Training DeepSpeech on gpu failed

Seham_Nasr · July 19, 2021, 11:00pm

I am trying to train deepspeech model by following steps in the train your own model documentation besides the play-book for deepspeech, besides reading the issue of the reports on Github for my problem.
I used the following environment:
Nvidia RTX 2070 with 8 GB dedicated memory
Ubuntu 18.04
Cuda 10.0 /Cudnn 7.6.5
Tensorflow-gpu 1.15.4

nvidia-smi

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1224 G /usr/lib/xorg/Xorg 27MiB |
| 0 N/A N/A 1334 G /usr/bin/gnome-shell 69MiB |
| 0 N/A N/A 1551 G /usr/lib/xorg/Xorg 173MiB |
| 0 N/A N/A 1679 G /usr/bin/gnome-shell 29MiB |
| 0 N/A N/A 2034 G /usr/lib/firefox/firefox 12MiB |
| 0 N/A N/A 2656 G …AAAAAAAAA= --shared-files 70MiB |
±----------------------------------------------------------------------------+

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

The training is going correctly with CPU but when I add the flag

–train_cudnn True

the following error raise:

I Enabling automatic mixed precision training.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1348, in _run_fn
self._extend_graph()
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op ‘CudnnRNNCanonicalToParams’ used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [dropout=0, seed=4568, num_params=8, input_mode=“linear_input”, T=DT_FLOAT, direction=“unidirectional”, rnn_mode=“lstm”, seed2=257]
Registered devices: [CPU, XLA_CPU]
Registered kernels:

 [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/home/seham/DeepSpeech/training/deepspeech_training/train.py”, line 982, in run_script
absl.app.run(main)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/absl/app.py”, line 312, in run
_run_main(main, args)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/absl/app.py”, line 258, in _run_main
sys.exit(main(argv))
File “/home/seham/DeepSpeech/training/deepspeech_training/train.py”, line 954, in main
train()
File “/home/seham/DeepSpeech/training/deepspeech_training/train.py”, line 529, in train
load_or_init_graph_for_training(session)
File “/home/seham/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 137, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File “/home/seham/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 112, in _load_or_init_impl
return _initialize_all_variables(session)
File “/home/seham/DeepSpeech/training/deepspeech_training/util/checkpoints.py”, line 88, in _initialize_all_variables
session.run(v.initializer)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op ‘CudnnRNNCanonicalToParams’ used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /home/seham/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [dropout=0, seed=4568, num_params=8, input_mode=“linear_input”, T=DT_FLOAT, direction=“unidirectional”, rnn_mode=“lstm”, seed2=257]
Registered devices: [CPU, XLA_CPU]
Registered kernels:

 [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

I have seen many people solved this error by repairing the environment settings, but I have tried this many times with Cuda 10.0 and cuda10.1 even with cuda 11.0. (on the other hand, the same issue occure with the docker image that has been explained in the play-book version)

pip list

Package Version Location

absl-py 0.13.0
alembic 1.6.5
appdirs 1.4.4
astor 0.8.1
attrdict 2.0.1
attrs 21.2.0
audioread 2.1.9
beautifulsoup4 4.9.3
bs4 0.0.1
cached-property 1.5.2
certifi 2021.5.30
cffi 1.14.6
charset-normalizer 2.0.3
cliff 3.8.0
cmaes 0.8.2
cmd2 2.1.2
colorama 0.4.4
colorlog 5.0.1
dataclasses 0.8
decorator 5.0.9
deepspeech-training 0.9.3 /home/seham/DeepSpeech/training
ds-ctcdecoder 0.9.3
gast 0.2.2
google-pasta 0.2.0
greenlet 1.1.0
grpcio 1.38.1
h5py 3.1.0
idna 3.2
importlib-metadata 4.6.1
joblib 1.0.1
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
librosa 0.8.1
llvmlite 0.31.0
Mako 1.1.4
Markdown 3.3.4
MarkupSafe 2.0.1
numba 0.47.0
numpy 1.18.5
opt-einsum 3.3.0
optuna 2.8.0
opuslib 2.0.0
packaging 21.0
pandas 1.1.5
pbr 5.6.0
pip 21.1.3
pkg-resources 0.0.0
pooch 1.4.0
prettytable 2.1.0
progressbar2 3.53.1
protobuf 3.17.3
pycparser 2.20
pyparsing 2.4.7
pyperclip 1.8.2
python-dateutil 2.8.2
python-editor 1.0.4
python-utils 2.5.6
pytz 2021.1
pyxdg 0.27
PyYAML 5.4.1
requests 2.26.0
resampy 0.2.2
scikit-learn 0.24.2
scipy 1.5.4
semver 2.13.0
setuptools 49.6.0
six 1.16.0
SoundFile 0.10.3.post1
soupsieve 2.2.1
sox 1.4.1
SQLAlchemy 1.4.21
stevedore 3.3.0
tensorboard 1.15.0
tensorflow 1.15.4
tensorflow-estimator 1.15.1
tensorflow-gpu 1.15.4
termcolor 1.1.0
threadpoolctl 2.2.0
tqdm 4.61.2
typing-extensions 3.10.0.0
urllib3 1.26.6
wcwidth 0.2.5
Werkzeug 2.0.1
wheel 0.34.2
wrapt 1.12.1
zipp 3.5.0

Any help is much appreciated.

othiele · July 20, 2021, 6:52am

Ask the coqui guys as they do a lot of the training currently. Read more in this post.

Seham_Nasr · July 20, 2021, 10:54am

Is there anything that I should edit in the configuration code ?

Seham_Nasr · July 20, 2021, 11:16pm

This is seriously unbelievable because cudnn and cuda have been tested and they are working fine!

Topic		Replies	Views
Can not train DeepSpeech on GTX 2070 DeepSpeech issue	7	1424	May 24, 2019
Failing to start the training with 0.7.0 DeepSpeech	5	1103	November 19, 2020
Fine tuning Deepspeech 0.9.1 with same alphabet DeepSpeech learning	40	1498	December 4, 2020
DeepSpeech problems with video card DeepSpeech	6	1760	July 15, 2019
Finetuning the model on gpu machine TTS (Text-to-Speech)	2	519	September 11, 2020

Training DeepSpeech on gpu failed

Related topics