Cudnn Error Faced

Python version : 3.6
other libs versions : as DeepSpeech documented

I used Deep speech 0.7.3 for training my model. I have near 1000h of data. When I start training the model. everything is ok but when the first Epoch is ended and it goes for validation I face the error I will paste down here.
So I decided to upgrade to 0.9.3 and the same happened.

I checked if I installed Cuda and Cudnn versions properly and they were as DeepSpeech documented.

I’ve tested both venv and conda. no improvement still the same error.

I added

import os
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

to first of DeepSpeech.py and still facing the same error.

And the error is :

raceback (most recent call last):
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, bat
ch_size, cell_num_units]: [1, 1024, 1024, 1, 78, 2, 1024]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, bat
ch_size, cell_num_units]: [1, 1024, 1024, 1, 78, 2, 1024]
[[{{node tower_0/cudnn_lstm/CudnnRNNV3}}]]
[[tower_0/raw_logits/_193]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 14, in
ds_train.run_script()
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 982, in run_script
absl.app.run(main)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 954, in main
train()
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 617, in train
set_loss, steps = run_set(‘dev’, epoch, init_op, dataset=source)
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 572, in run_set
feed_dict=feed_dict)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1024, 1024, 1, 78, 2, 1024]

raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1024, 1024, 1, 78, 2, 1024]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1024, 1024, 1, 78, 2, 1024]
[[node tower_0/cudnn_lstm/CudnnRNNV3 (defined at /home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/raw_logits/_193]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘tower_0/cudnn_lstm/CudnnRNNV3’:
File “DeepSpeech.py”, line 14, in
ds_train.run_script()
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 982, in run_script
absl.app.run(main)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/absl/app.py”, line 303, in run
_run_main(main, args)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/absl/app.py”, line 251, in _run_main
sys.exit(main(argv))
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 954, in main
train()
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 484, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 317, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 244, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 195, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File “/home/shenasa/masoud_parpanchi/DeepSpeech/training/deepspeech_training/train.py”, line 133, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py”, line 548, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py”, line 854, in call
outputs = call_fn(cast_inputs, *args, **kwargs)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 234, in wrapper
return converted_call(f, options, args, kwargs)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py”, line 330, in _call_unconverted
return f(*args, **kwargs)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 440, in call
training)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py”, line 518, in _forward
seed=self._seed)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py”, line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py”, line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/home/shenasa/anaconda3/envs/conda_deep_speech/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

Check your dev data, there is one or more files that can’t be read. Look here for some scripts on how to check that. Another reason could be mismatches of wav vs. transcript. With really short wavs for a long transcript. But check if files are readable first.

This could be tensorflow bug fixed by 1.15.4, but some people report not complete fix sometimes. Hard to know for sure.

Try and set TF_CUDNN_RESET_RND_GEN_STATE=1 as env var, upstream issue: https://github.com/tensorflow/tensorflow/issues/41630

hello lissyx,

I met the same problem. The Error info is totally same.

P.S. python version: 3.6 cuda version: 10.0 cudnn version: 7.6. My GPU is GTX1660 so I set a very small batch size.

I tried that add ‘TF_CUDNN_RESET_RND_GEN_STATE=1’ as you said and ‘TF_FORCE_GPU_ALLOW_GROWTH=true’ as document said. But both didn’t work.

By the way the last Error when I run the DeepSpeech.py was ‘return a non-zero…’. I haven’t fixed it yet and this Error shows up.

How can I solve this problem?

sincerely

Hello

as I said to you in private chat ( I’ll say it here too. because others may see this)

I solved this issue but I can’t remember what exactly I did,

you can reduce batch size, ( validation and test batchsize to 1 and train as much as you can)

check all your files ( maybe they cause the problem)

use tf growth

use conda env

One of this solutions may help you

please avoid, we have had spurious reports of issues related to conda, so let’s not add more to this here