Problem training model

Hello,
I am getting the following error training the model:

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:01:18 | Steps: 50 | Loss: 157.871378                                                                        Traceback (most recent call last):
  File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 168, 300, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
	 [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 168, 300, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
0 successful operations.
0 derived errors ignored.

I am running the model with the following parameters:

python DeepSpeech.py \
--inter_op_parallelism_threads 4 \
--train_files speech_data/clips/train.csv \
--test_files speech_data/clips/test.csv \
--train_cudnn  \
--summary_dir tensorboard_summary_$ejecution \
--checkpoint_dir checkpoint_$ejecution \
--export_dir model_out_$ejecution \
--epochs 30 \
--train_batch_size 300 \
--test_batch_size 100 \
--learning_rate 0.001 

I am using:
cuda-10.0
libcudnn 7.4.2

And I am working in version:
branch master
revision -> 080dc7df

Thanks for your help.

try to use cudnn 7.6

I have tried with cudnn 7.6.0 and 7.6.5 but unfortunately still hapening the same problem.

try reduce batch size ?

And set dev batch size.

I would recommend running everything with a batch size of 16 or 32 for 1 epoch to see if everything is running smoothly. If you mix datasets you usually get some strange errors the first couple of times. Then use a higher batch size for train and dev.

I have problem with setting trainning and make me frustacing…

i think not need to give a specification, because the training is run.

last base order:
XLA_PYTHON_CLIENT_ALLOCATOR=platform TF_XLA_FLAGS=–tf_xla_cpu_global_jit python3 -u /home/bram/Documents/coding/speech/deepspeech/DeepSpeech.py --train_files “/home/bram/Documents/coding/speech/traindata/CVdata/clips/train.csv” --dev_files “/home/bram/Documents/coding/speech/traindata/CVdata/clips/dev.csv” --test_files “/home/bram/Documents/coding/speech/traindata/CVdata/clips/test.csv” --alphabet_config_path “/home/bram/Documents/speech/traindata/corpus/alphabet.txt” --lm_binary_path “/home/bram/Documents/speech/traindata/corpus/lm.binary” --lm_trie_path “/home/bram/Documents/speech/traindata/corpus/trie” --learning_rate 0.000025 --dropout_rate 0.2 --log_level 1 --epochs 12 --export_dir “/home/bram/Documents/speech/checkpoint” --checkpoint_dir “/home/bram/Documents/speech/checkpoint” --use_allow_growth true --train_batch_size 24

what already i have do:
set many varians of hyperparameters:

  1. learning_rate: 0.000025, 0.0001, 0.00001, 0.0000125, etc
  2. dropout_rate: 0.2, 0.15, 0.25, 0.4, 05
  3. n_hidden : 2048, 1536, 1024, 768.
  4. batch size: 4, 8, 12.

the problem:

  1. if i choose batch 12, at epoch 13 start overfitting, happens i batch 8, at epoch 9 start overfitting.
  2. validation result very variable depend hyper parameters, but it’s never reach below 60%, my best record is 64%, after that start overfitting. off course the result is very mess.

another problem, can save the checkpoint but can create model, bu i think i can handle it later, after the train got good result.

dataset:
Common Voice Bahasa Indonesia.

You probably don’t have enough data to get better results. How many hours do you have?

Just set a different dir for the export_dir to save results. But a .pb file schould be in the checkpoint dir.

And did you try a learning rate of 0.001? I know, you should be better off with a lower value, but you should be fine with that for enough data.

base on https://voice.mozilla.org/en/datasets, Indonesian.

the dataset total is 3 hours.

for learning rate 0.001, i will try it and report it soon, is still overfitting or not…
try to use different dropout rate. first try use default dropout rate.
report soon…

Your results are perfect for 3 hours, you need about 200-300 hours to get somewhat decent results for general language understanding.

thanks for your quick response…

after train with leraning rate 0.001, try to 48 epochs first, but it stop in epoch 10.
last train loss: 102.35, last val loss: 104.40, i guess it’s because of early-stop train, that used to stop the train if detect overfitting in train process.

the graph:


legend of picture:
orange: train loss
blue valid loss
really frustated with hyperparameter, need an advise

btw, how you calculate the time that needed to train 3 hours dataset is about 200-300 hours. if i make batch 12, 1 epoch need about 04:38 to train and 51 second to validate.
so 300 hours means: 4000 epochs.

what your advise in hyper parameters? should i use 1 batch? or turn off early stop setting?

Sorry, I meant you need 200 hours of input audio. Training times vary widely by the GPU used.

You can change hyperparameters, but it is unlikely your results will change a lot. You need more data.

i follow the instruction to train from:
https://tilmankamp.github.io/FOSDEM2018/, that used ken-lm.

mean while, in github, https://github.com/mozilla/DeepSpeech/blob/master/doc/TRAINING.rst#training-a-model, no mention anything with ken-lm.

let we clear it first…what metod of trainning that we means…I’m affraid it’s different metod.

this time, i will any suggestion / manual to trainning the speech…any good reference?

1 Like

For 0.6 version look here, for the current master look into both pys in the data/lm folder and search here for scorer. The links above are for the old and new version.

Please be polite and don’t spam the forum with your question.