Problem training model

Rafael_Campana · March 29, 2020, 12:35pm

Hello,
I am getting the following error training the model:

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:01:18 | Steps: 50 | Loss: 157.871378                                                                        Traceback (most recent call last):
  File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/fali/projects/deepspeech-train-venv/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 168, 300, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
	 [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 168, 300, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
0 successful operations.
0 derived errors ignored.

I am running the model with the following parameters:

python DeepSpeech.py \
--inter_op_parallelism_threads 4 \
--train_files speech_data/clips/train.csv \
--test_files speech_data/clips/test.csv \
--train_cudnn  \
--summary_dir tensorboard_summary_$ejecution \
--checkpoint_dir checkpoint_$ejecution \
--export_dir model_out_$ejecution \
--epochs 30 \
--train_batch_size 300 \
--test_batch_size 100 \
--learning_rate 0.001

I am using:
cuda-10.0
libcudnn 7.4.2

And I am working in version:
branch master
revision -> 080dc7df

Thanks for your help.

lissyx · March 29, 2020, 1:14pm

try to use cudnn 7.6

Rafael_Campana · March 29, 2020, 2:27pm

I have tried with cudnn 7.6.0 and 7.6.5 but unfortunately still hapening the same problem.

lissyx · March 29, 2020, 3:07pm

try reduce batch size ?

othiele · March 29, 2020, 7:04pm

And set dev batch size.

I would recommend running everything with a batch size of 16 or 32 for 1 epoch to see if everything is running smoothly. If you mix datasets you usually get some strange errors the first couple of times. Then use a higher batch size for train and dev.

wahyubram82 · April 6, 2020, 7:59am

I have problem with setting trainning and make me frustacing…

i think not need to give a specification, because the training is run.

last base order:
XLA_PYTHON_CLIENT_ALLOCATOR=platform TF_XLA_FLAGS=–tf_xla_cpu_global_jit python3 -u /home/bram/Documents/coding/speech/deepspeech/DeepSpeech.py --train_files “/home/bram/Documents/coding/speech/traindata/CVdata/clips/train.csv” --dev_files “/home/bram/Documents/coding/speech/traindata/CVdata/clips/dev.csv” --test_files “/home/bram/Documents/coding/speech/traindata/CVdata/clips/test.csv” --alphabet_config_path “/home/bram/Documents/speech/traindata/corpus/alphabet.txt” --lm_binary_path “/home/bram/Documents/speech/traindata/corpus/lm.binary” --lm_trie_path “/home/bram/Documents/speech/traindata/corpus/trie” --learning_rate 0.000025 --dropout_rate 0.2 --log_level 1 --epochs 12 --export_dir “/home/bram/Documents/speech/checkpoint” --checkpoint_dir “/home/bram/Documents/speech/checkpoint” --use_allow_growth true --train_batch_size 24

what already i have do:
set many varians of hyperparameters:

learning_rate: 0.000025, 0.0001, 0.00001, 0.0000125, etc
dropout_rate: 0.2, 0.15, 0.25, 0.4, 05
n_hidden : 2048, 1536, 1024, 768.
batch size: 4, 8, 12.

the problem:

if i choose batch 12, at epoch 13 start overfitting, happens i batch 8, at epoch 9 start overfitting.
validation result very variable depend hyper parameters, but it’s never reach below 60%, my best record is 64%, after that start overfitting. off course the result is very mess.

another problem, can save the checkpoint but can create model, bu i think i can handle it later, after the train got good result.

dataset:
Common Voice Bahasa Indonesia.

othiele · April 6, 2020, 8:08am

You probably don’t have enough data to get better results. How many hours do you have?

Just set a different dir for the export_dir to save results. But a .pb file schould be in the checkpoint dir.

And did you try a learning rate of 0.001? I know, you should be better off with a lower value, but you should be fine with that for enough data.

wahyubram82 · April 7, 2020, 11:24am

base on https://voice.mozilla.org/en/datasets, Indonesian.

the dataset total is 3 hours.

for learning rate 0.001, i will try it and report it soon, is still overfitting or not…
try to use different dropout rate. first try use default dropout rate.
report soon…

othiele · April 7, 2020, 11:52am

Your results are perfect for 3 hours, you need about 200-300 hours to get somewhat decent results for general language understanding.

wahyubram82 · April 7, 2020, 12:20pm

thanks for your quick response…

after train with leraning rate 0.001, try to 48 epochs first, but it stop in epoch 10.
last train loss: 102.35, last val loss: 104.40, i guess it’s because of early-stop train, that used to stop the train if detect overfitting in train process.

the graph:

legend of picture:
orange: train loss
blue valid loss
really frustated with hyperparameter, need an advise

wahyubram82 · April 7, 2020, 12:28pm

btw, how you calculate the time that needed to train 3 hours dataset is about 200-300 hours. if i make batch 12, 1 epoch need about 04:38 to train and 51 second to validate.
so 300 hours means: 4000 epochs.

what your advise in hyper parameters? should i use 1 batch? or turn off early stop setting?

othiele · April 7, 2020, 12:32pm

Sorry, I meant you need 200 hours of input audio. Training times vary widely by the GPU used.

You can change hyperparameters, but it is unlikely your results will change a lot. You need more data.

wahyubram82 · April 7, 2020, 12:44pm

i follow the instruction to train from:
https://tilmankamp.github.io/FOSDEM2018/, that used ken-lm.

mean while, in github, https://github.com/mozilla/DeepSpeech/blob/master/doc/TRAINING.rst#training-a-model, no mention anything with ken-lm.

let we clear it first…what metod of trainning that we means…I’m affraid it’s different metod.

this time, i will any suggestion / manual to trainning the speech…any good reference?

othiele · April 7, 2020, 12:54pm

For 0.6 version look here, for the current master look into both pys in the data/lm folder and search here for scorer. The links above are for the old and new version.

lissyx · August 12, 2020, 8:06am

Please be polite and don’t spam the forum with your question.

Topic		Replies	Views
Cudnn Error Faced DeepSpeech	5	978	March 24, 2021
Always get stuck after 1 Training Epoch when I do transfer learning DeepSpeech	1	1089	July 10, 2021
Training DeepSpeech on gpu failed DeepSpeech issue	3	1161	July 20, 2021
Problem with fine tuning 0.81 checkpoint for specific domain like Biology DeepSpeech	25	1079	December 14, 2020
Failing to start the training with 0.7.0 DeepSpeech	5	1097	November 19, 2020

Problem training model

Related topics