Limit_test flag not working as expected

Hi there! New to posting here.

I am trying to train a model on Common Voice data corpus for Spanish.

This is the command I use to execute DeepSpeech:

python -u DeepSpeech.py \
  --train_files /data/cv_es/train.csv \
  --test_files /data/cv_es/test.csv \
  --dev_files /data/cv_es/dev.csv \
  --train_batch_size 300 \
  --dev_batch_size 150 \
  --test_batch_size 75 \
  --limit_test 1 \
  --n_hidden 100 \
  --epochs 1 \
  --checkpoint_dir /checkpoints \
  "$@"

And this is the output I receive:

+ [ ! -f DeepSpeech.py ]
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --train_files /data/cv_es/train.csv --test_files /data/cv_es/test.csv --dev_files /data/cv_es/dev.csv --train_batch_size 300 --dev_batch_size 150 --test_batch_size 75 --limit_test 1 --n_hidden 100 --epochs 1 --checkpoint_dir /checkpoints
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /checkpoints/train-586
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                               E The following files caused an infinite (or NaN) loss: /data/cv_es/../cv_es/clips/common_voice_es_18365446.wav,/data/cv_es/../cv_es/clips/common_voice_es_18960406.wav,/data/cv_es/../cv_es/clips/common_voice_es_18956393.wav
Epoch 0 |   Training | Elapsed Time: 0:00:04 | Steps: 12 | Loss: inf                                                                                   E The following files caused an infinite (or NaN) loss: /data/cv_es/../cv_es/clips/common_voice_es_19999752.wav
Epoch 0 |   Training | Elapsed Time: 0:01:31 | Steps: 293 | Loss: inf
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 114.047615 | Dataset: /data/cv_es/dev.csv                                              E The following files caused an infinite (or NaN) loss: /data/cv_es/../cv_es/clips/common_voice_es_19722821.wav
Epoch 0 | Validation | Elapsed Time: 0:00:27 | Steps: 168 | Loss: inf | Dataset: /data/cv_es/dev.csv
--------------------------------------------------------------------------------
I FINISHED optimization in 0:01:59.062929
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /checkpoints/train-879
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /data/cv_es/test.csv
Test epoch | Steps: 7 | Elapsed Time: 0:01:29

Now, the problem is that I don’t know whether or not the flag ‘–limit_test 1’ is doing what I suppose it does. After reading the docs for such parameter, I figured out it lets you limit the amount of samples that get used from the specified dataset.

Citation below:

# Global Constants
    # ================

    # Rest of the code ommited
    [...]

    # Sample limits

    f.DEFINE_integer('limit_train', 0, 'maximum number of elements to use from train set - 0 means no limit')
    f.DEFINE_integer('limit_dev', 0, 'maximum number of elements to use from validation set - 0 means no limit')
    f.DEFINE_integer('limit_test', 0, 'maximum number of elements to use from test set - 0 means no limit')

My test data file has around 12600 entries. From the output I understand that the model is ignoring such flag and using the complete test dataset. As I have a test_batch_size of 75, shouldn’t it end after just the first step? If that is the case, then why is my model’s testing already in 7th step. That would account for 7 (steps) * 75 (samples per step) samples processed right?

I am sorry if I am missing something trivial, I am really new to Machine Learning and AI in general. Maybe someone has some clues and can help.

Thanks in advance to everyone that gets to read me!

We can’t help since you did not bother mentionning what version you are working on.

I am sorry I missed that.

I used the DeepSpeech documentation (v0.8.0)
CUDA version is 11.0
Also, using Docker and NVIDIA support for Docker

This model is being run inside a Docker container created from the template Dockerfile hosted on the DeepSpeech repository.

If this is not what you need, let me know so I can include the missing information.

Thanks @lissyx .

You mention documentation, are you sure the code is v0.8.0 as well?

This is wrong. You should use 10.0, because of TensorFlow r1.15.

FTR @Tilman_Kamp fixed that for v0.8.0, I don’t know the details of how it should precisely work in your case.

I have just checked it, I was running code version ‘0.9.0-alpha.3’.

>> cat /DeepSpeech/VERSION
0.9.0-alpha.3

Oh, I didn’t know that. I will install the proper version and try again.

I will look that up to see if it helps in my case. Moreover, I will specify the DeepSpeech code version when installing it inside the Docker container. I think I just left it as default.

Thank you very much!!

We have deepspeech-train:v0.8.0, just use that?

Hi - I found from the v0.8.0 documentation which suggests otherwise:

https://deepspeech.readthedocs.io/en/v0.8.0/USING.html#cuda-dependency

The GPU capable builds (Python, NodeJS, C++, etc) depend on the same CUDA runtime as upstream TensorFlow. Currently with TensorFlow 2.2 it depends on CUDA 10.1 and CuDNN v7.6. See the TensorFlow documentation.

Would like to check if the current requirement is TensorFlow 1.15 and CUDA 10.0 or TensorFlow 2.2 and CUDA 10.1 for v0.8.0? Thanks.

I’m sorry, I’m not quite sure I got what you ment with that. Are you referring to the brand new v0.8.0 release?

The current requirement for training is TensorFlow 1.15 and CUDA 10.0.

The current requirement for using GPU-enabled inference packages is CUDA 10.1. Note that you don’t have to install TensorFlow at all if you’re using the inference packages.

1 Like

As of today, after the latest v0.8.0 release and checking the setup.py file (https://github.com/mozilla/DeepSpeech/blob/v0.8.0/setup.py), code line 77 says it depends on Tensorflow v1.15.2.

The tf documentation (https://www.tensorflow.org/install/source#tested_build_configurations) indicates that such version (but older release) has been tested with these software versions:

  • Tensorflow: tensorflow_gpu-1.15.0
  • Python: 2.7, 3.3-3.7
  • GCC: 7.3.1
  • Bazel: 0.26.1
  • cuDNN: 7.4
  • CUDA: 10.0

But I don’t know whether 1.15.2 would still work with this setup, I guess we’ll need to try. I haven’t found anything else.