Hi there! New to posting here.
I am trying to train a model on Common Voice data corpus for Spanish.
This is the command I use to execute DeepSpeech:
python -u DeepSpeech.py \
--train_files /data/cv_es/train.csv \
--test_files /data/cv_es/test.csv \
--dev_files /data/cv_es/dev.csv \
--train_batch_size 300 \
--dev_batch_size 150 \
--test_batch_size 75 \
--limit_test 1 \
--n_hidden 100 \
--epochs 1 \
--checkpoint_dir /checkpoints \
"$@"
And this is the output I receive:
+ [ ! -f DeepSpeech.py ]
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --train_files /data/cv_es/train.csv --test_files /data/cv_es/test.csv --dev_files /data/cv_es/dev.csv --train_batch_size 300 --dev_batch_size 150 --test_batch_size 75 --limit_test 1 --n_hidden 100 --epochs 1 --checkpoint_dir /checkpoints
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /checkpoints/train-586
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 E The following files caused an infinite (or NaN) loss: /data/cv_es/../cv_es/clips/common_voice_es_18365446.wav,/data/cv_es/../cv_es/clips/common_voice_es_18960406.wav,/data/cv_es/../cv_es/clips/common_voice_es_18956393.wav
Epoch 0 | Training | Elapsed Time: 0:00:04 | Steps: 12 | Loss: inf E The following files caused an infinite (or NaN) loss: /data/cv_es/../cv_es/clips/common_voice_es_19999752.wav
Epoch 0 | Training | Elapsed Time: 0:01:31 | Steps: 293 | Loss: inf
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 114.047615 | Dataset: /data/cv_es/dev.csv E The following files caused an infinite (or NaN) loss: /data/cv_es/../cv_es/clips/common_voice_es_19722821.wav
Epoch 0 | Validation | Elapsed Time: 0:00:27 | Steps: 168 | Loss: inf | Dataset: /data/cv_es/dev.csv
--------------------------------------------------------------------------------
I FINISHED optimization in 0:01:59.062929
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /checkpoints/train-879
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /data/cv_es/test.csv
Test epoch | Steps: 7 | Elapsed Time: 0:01:29
Now, the problem is that I don’t know whether or not the flag ‘–limit_test 1’ is doing what I suppose it does. After reading the docs for such parameter, I figured out it lets you limit the amount of samples that get used from the specified dataset.
Citation below:
# Global Constants
# ================
# Rest of the code ommited
[...]
# Sample limits
f.DEFINE_integer('limit_train', 0, 'maximum number of elements to use from train set - 0 means no limit')
f.DEFINE_integer('limit_dev', 0, 'maximum number of elements to use from validation set - 0 means no limit')
f.DEFINE_integer('limit_test', 0, 'maximum number of elements to use from test set - 0 means no limit')
My test data file has around 12600 entries. From the output I understand that the model is ignoring such flag and using the complete test dataset. As I have a test_batch_size of 75, shouldn’t it end after just the first step? If that is the case, then why is my model’s testing already in 7th step. That would account for 7 (steps) * 75 (samples per step) samples processed right?
I am sorry if I am missing something trivial, I am really new to Machine Learning and AI in general. Maybe someone has some clues and can help.
Thanks in advance to everyone that gets to read me!