I have setup training on a Deep Learning AMI from AWS with a vanilla Python virtual environment on 3.6 and nvcc -version says I got CUDA 10.0 with a CuDNN lib for 7.5. I did not use a conda env as I read here this is not ecouraged.
When I start my training with
export TF_FORCE_GPU_ALLOW_GROWTH=true python3 -u /home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py \ --train_files "/home/ubuntu/deepspeech/train.csv" \ --dev_files "/home/ubuntu/deepspeech/dev.csv" \ --test_files "/home/ubuntu/deepspeech/test.csv" \ --scorer "/home/ubuntu/deepspeech/kenlm.scorer" \ --alphabet_config_path "/home/ubuntu/deepspeech/alphabet.txt" \ --train_batch_size 16 \ --dev_batch_size 16\ --test_batch_size 4 \ --learning_rate 0.0001 \ --dropout_rate 0.3 \ --epochs 15 \ --train_cudnn True\ --use_allow_growth True\ --automatic_mixed_precision True
I get the Fail to find the dnn implementation.
error:
I Enabling automatic mixed precision training.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/home/ubuntu/deepspeech/DeepSpeech/DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py”, line 976, in run_script
absl.app.run(main)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py”, line 948, in main
train()
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/train.py”, line 527, in train
load_or_init_graph_for_training(session)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 137, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 112, in _load_or_init_impl
return _initialize_all_variables(session)
File “/home/ubuntu/deepspeech/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 88, in _initialize_all_variables
session.run(v.initializer)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/ubuntu/deepspeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Any ideas on how to solve or debug that would be appreciated