Problem running on AWS EC2 DL AMI - Fail to find the dnn implementation

keliji3451 · January 21, 2021, 8:57am

I have setup training on a Deep Learning AMI from AWS with a vanilla Python virtual environment on 3.6 and nvcc -version says I got CUDA 10.0 with a CuDNN lib for 7.5. I did not use a conda env as I read here this is not ecouraged.

When I start my training with

export TF_FORCE_GPU_ALLOW_GROWTH=true

python3 -u /home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py \
    --train_files "/home/ubuntu/deepspeech/train.csv" \
    --dev_files "/home/ubuntu/deepspeech/dev.csv" \
    --test_files "/home/ubuntu/deepspeech/test.csv" \
	--scorer "/home/ubuntu/deepspeech/kenlm.scorer" \
	--alphabet_config_path "/home/ubuntu/deepspeech/alphabet.txt" \
    --train_batch_size 16 \
    --dev_batch_size 16\
    --test_batch_size 4 \
    --learning_rate 0.0001 \
    --dropout_rate 0.3 \
    --epochs 15 \
    --train_cudnn True\
	--use_allow_growth True\
    --automatic_mixed_precision True

I get the Fail to find the dnn implementation. error:

I Enabling automatic mixed precision training.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py”, line 976, in run_script
absl.app.run(main)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py”, line 948, in main
train()
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py”, line 527, in train
load_or_init_graph_for_training(session)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 137, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 112, in _load_or_init_impl
return _initialize_all_variables(session)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 88, in _initialize_all_variables
session.run(v.initializer)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Any ideas on how to solve or debug that would be appreciated

lissyx · January 21, 2021, 9:29am

Have you properly searched the documentation ?

https://deepspeech.readthedocs.io/en/latest/TRAINING.html?highlight=cudnn#prerequisites-for-training-a-model

keliji3451 · January 21, 2021, 9:42am

Thanks for pointing that out, had a look at the link to TF documentation an it states 7.4 for TF 1.15:

https://www.tensorflow.org/install/source#gpu

Is there any simple or recommended way to upgrade to CuDNN 7.6?

lissyx · January 21, 2021, 10:14am

Sorry, but read the link i shared, its 7.6. We cant help for aws specific. Please investigate the ami, i guess it provides already a tensorflow-gpu package. It needs to be 1.15. If that’s the case, use DS_NOTENSORFLOW=y when running deepspeech install. Search the repo for example usage of that. It will disable our install of tensorflow dep and so should use your ami provided one.

keliji3451 · January 21, 2021, 11:56am

Thanks @lissyx. Solved it now.

You have to use 7.6 and you do that by:

downloading the 2 packages for 7.6.5.31 from nvidia.
installing them as in the docs
adding the import /usr/lib/x86_64-linux-gnu in front of the others in LD-LIB in .dlamirc or your startup script so this CuDNN version is found before the other one

lissyx · January 21, 2021, 12:29pm

Good to have it documented for others, thanks.