Hello,
I am getting the following error training the model:
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:01:18 | Steps: 50 | Loss: 157.871378 Traceback (most recent call last):
File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 168, 300, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 168, 300, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
0 successful operations.
0 derived errors ignored.
I am running the model with the following parameters:
python DeepSpeech.py \
--inter_op_parallelism_threads 4 \
--train_files speech_data/clips/train.csv \
--test_files speech_data/clips/test.csv \
--train_cudnn \
--summary_dir tensorboard_summary_$ejecution \
--checkpoint_dir checkpoint_$ejecution \
--export_dir model_out_$ejecution \
--epochs 30 \
--train_batch_size 300 \
--test_batch_size 100 \
--learning_rate 0.001
I am using:
cuda-10.0
libcudnn 7.4.2
And I am working in version:
branch master
revision -> 080dc7df
Thanks for your help.