Problem running on AWS EC2 DL AMI - Fail to find the dnn implementation

I have setup training on a Deep Learning AMI from AWS with a vanilla Python virtual environment on 3.6 and nvcc -version says I got CUDA 10.0 with a CuDNN lib for 7.5. I did not use a conda env as I read here this is not ecouraged.

When I start my training with


python3 -u /home/ubuntu/deepspeech/DeepSpeech/ \
    --train_files "/home/ubuntu/deepspeech/train.csv" \
    --dev_files "/home/ubuntu/deepspeech/dev.csv" \
    --test_files "/home/ubuntu/deepspeech/test.csv" \
	--scorer "/home/ubuntu/deepspeech/kenlm.scorer" \
	--alphabet_config_path "/home/ubuntu/deepspeech/alphabet.txt" \
    --train_batch_size 16 \
    --dev_batch_size 16\
    --test_batch_size 4 \
    --learning_rate 0.0001 \
    --dropout_rate 0.3 \
    --epochs 15 \
    --train_cudnn True\
	--use_allow_growth True\
    --automatic_mixed_precision True

I get the Fail to find the dnn implementation. error:

I Enabling automatic mixed precision training.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 1365, in _do_call
return fn(*args)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 1443, in _call_tf_sessionrun
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ubuntu/deepspeech/DeepSpeech/”, line 12, in
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/”, line 976, in run_script
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/”, line 303, in run
_run_main(main, args)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/”, line 251, in _run_main
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/”, line 948, in main
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/”, line 527, in train
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/”, line 137, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/”, line 112, in _load_or_init_impl
return _initialize_all_variables(session)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/”, line 88, in _initialize_all_variables
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 956, in run
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 1359, in _do_run
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ ]]

Any ideas on how to solve or debug that would be appreciated :slight_smile:

Have you properly searched the documentation ?

Thanks for pointing that out, had a look at the link to TF documentation an it states 7.4 for TF 1.15:

Is there any simple or recommended way to upgrade to CuDNN 7.6?

Sorry, but read the link i shared, its 7.6. We cant help for aws specific. Please investigate the ami, i guess it provides already a tensorflow-gpu package. It needs to be 1.15. If that’s the case, use DS_NOTENSORFLOW=y when running deepspeech install. Search the repo for example usage of that. It will disable our install of tensorflow dep and so should use your ami provided one.

Thanks @lissyx. Solved it now.

You have to use 7.6 and you do that by:

  • downloading the 2 packages for from nvidia.

  • installing them as in the docs

  • adding the import /usr/lib/x86_64-linux-gnu in front of the others in LD-LIB in .dlamirc or your startup script so this CuDNN version is found before the other one

Good to have it documented for others, thanks.