Running Deepspeech 0.7.4 on Google Commands Dataset

Hello,

I have been running Deepspeech 0.7.4 on the Google commands dataset for a few hours now. It is about 65k wavs with 1 second utterances each. Link to data here

I have an older GPU currently (Nvidia Quadro P690).

I have training all good to go but I had a few questions about some of the script options with DeepSpeech.py.

The --checkpoint_dir option doesn’t seem to save any of my model checkpoints in that location. it always defaults to ~/.local/share/deepspeech/checkpoints for some reason. I am not sure if --export_dir will have issues as well when the model is completed training. Are there defaults for where these are saved?

Also, I have been at Epoch 0 for quite some time even though I specified --epochs 1 is Epoch 0 just the notation for the “First epoch”?

Given that these are 65k one-second utterances, is it strange that it is taking hours to train a single epoch? Are there any issues in my bash script that might cause the script to train indefinitely? I just want to get a gauge for how long this will take. I am using the default batch_size = 1 which I am sure is part of the reason. I am at about 26k steps after 3 hours and 36 minutes. I assume a single step is equal to 1 batch put through the entire network once, correct?

Thank you for any help. My bash script is below:

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;


export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
export TF_FORCE_GPU_ALLOW_GROWTH=true

python -u DeepSpeech.py \
  --train_files //tmp/external/google_cmds_csvs/train.csv \
  --test_files  //tmp/external/google_cmds_csvs/test.csv \
  --dev_files  //tmp/external/google_cmds_csvs/dev.csv \
 # --alphabet_config_path //DeepSpeech/data/alphabet.txt \
 # --scorer_path //tmp/external/deepspeech-0.7.1-models.scorer #  \
 # --n_hidden 100 \
  --epochs 1 \
  --train_batch_size 1 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --export_dir //tmp/external/deepspeech_models/googlecommands/ \
  --checkpoint_dir //tmp/external/model_chkpts \  
 # "$@"

There are, don’t worry, just check whether you used the right parameter names. You can create a model from a checkpoint if all else fails.

Yes you are right, whether sth starts at 0 or 1 will be the death of me one day :slight_smile:

No, depends on hardware. Use higher batch sizes.

Yes, use a train batch size of 4 or 8 or higher without getting an OutOfMemoryError.

And set dropout to sth like 0.25, 0.4 depending on the data.

Great. Thank you @othiele . I figured things were just taking some time due to low batch size. I am used to training CNNs on my own hardware so still trying to set my expectations on what to expect for RNNs with a weaker GPU.

Is there any rhyme or reason to why the --checkpoint_dir argument doesn’t write checkpoints to that location? I know it exists, I am just curious as to why it decides to go to ~/.local

Thanks again Olaf. Cleared up majority of my issues.

Maybe because of the backslash at the end, so it doesn’t recognize it as a command?

And for future reference, defaults are in the flags.py

Perfect. Thanks! I can’t believe I missed that ending backslash. I’ll have to fix that in my next training and see if that fixes it.

Hi again @othiele . How does one go about creating a model from checkpoint files? I want to take the checkpoint files I have, and create a .pb file with it so I can convert to .pbmm and then try some benchmarking with the deepspeech client.

It’s in the training documentation, export section.

My model was stopped before it finished training. There is nothing in the --export_dir path I specified. I just have a folder of summaries and checkpoints.

I am not seeing how this documentation can assist me. Maybe I am looking in the wrong section?

Checkpointing

During training of a model so-called checkpoints will get stored on disk. This takes place at a configurable time interval. The purpose of checkpoints is to allow interruption (also in the case of some unexpected failure) and later continuation of training without losing hours of training time. Resuming from checkpoints happens automatically by just (re)starting training with the same --checkpoint_dir of the former run. Alternatively, you can specify more fine grained options with --load_checkpoint_dir and --save_checkpoint_dir , which specify separate locations to use for loading and saving checkpoints respectively. If not specified these flags use the same value as --checkpoint_dir , ie. load from and save to the same directory.

Be aware however that checkpoints are only valid for the same model geometry they had been generated from. In other words: If there are error messages of certain Tensors having incompatible dimensions, this is most likely due to an incompatible model change. One usual way out would be to wipe all checkpoint files in the checkpoint directory or changing it before starting the training.

Exporting a model for inference

If the --export_dir parameter is provided, a model will have been exported to this directory during training. Refer to the usage instructions for information on running a client that can use the exported model.

Exporting a model for TFLite

If you want to experiment with the TF Lite engine, you need to export a model that is compatible with it, then use the --export_tflite flags. If you already have a trained model, you can re-export it for TFLite by running DeepSpeech.py again and specifying the same checkpoint_dir that you used for training, as well as passing --export_tflite --export_dir /model/export/destination .

How about the two “exporting” parts? That’s what you want.

Then you need to share logs if you need help on why it stopped or finish training.

I stopped it on purpose with ctrl+c because my .sh script was apparently incorrect (I have it posted above in this thread) and it was training for more than the --epochs 1 argument I specified. I had to stop the training or else I’d be waiting another 300 hours for it to finish since the default epochs is 75. I only wanted to train it for one epoch and it went over that, thus I force stopped it.

I am inclined to believe that since my .sh script had issues that it didn’t recognize my --export_dir argument and thus never exported anything.

Why? You stopped in the middle of training manually, we perform the export after the training, so it’s normal there has been no export.

It trained through 2 complete epochs but due to it taking 6 and a half hours per epoch, I needed to cancel it since it would keep training due to the --epochs 1 argument not stopping at 1.

So there is no way to recover a .pb from the checkpoints? I need to train again? It is not the end of the world but I just want to assess.

However, I am now getting OOM errors after trying to train fresh with cleared checkpoints

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Traceback (most recent call last):
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [1,19,26,494] and type float
[[{{node tower_0/conv1d/ExpandDims_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 955, in run_script
absl.app.run(main)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/DeepSpeech/training/deepspeech_training/train.py”, line 927, in main
train()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 595, in train
train_loss, _ = run_set(‘train’, epoch, train_init_op)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 560, in run_set
feed_dict=feed_dict)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [1,19,26,494] and type float
[[node tower_0/conv1d/ExpandDims_1 (defined at /DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘tower_0/conv1d/ExpandDims_1’:
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 955, in run_script
absl.app.run(main)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/DeepSpeech/training/deepspeech_training/train.py”, line 927, in main
train()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 473, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 312, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 239, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 167, in create_model
batch_x = create_overlapping_windows(batch_x)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 69, in create_overlapping_windows
batch_x = tf.nn.conv1d(input=batch_x, filters=eye_filter, stride=1, padding=‘SAME’)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 574, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 574, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py”, line 1672, in conv1d
filters = array_ops.expand_dims(filters, 0)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py”, line 180, in wrapper
return target(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py”, line 265, in expand_dims
return expand_dims_v2(input, axis, name)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py”, line 180, in wrapper
return target(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py”, line 314, in expand_dims_v2
return gen_array_ops.expand_dims(input, axis, name)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py”, line 2465, in expand_dims
“ExpandDims”, input=input, dim=axis, name=name)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

There is, you need to set to 0 or 1 epoch, and/or ensure you pass --load to load the proper checkpoint

But when you CTRL+C you interrupted completely the flow

No idea why, and hard to debug without more context (command line, hardware, …)

Nevermind, you shared those at the begining.

Can’t find any doc for Nvidia Quadro P690, only one I found I P620 and that indeeed seems old.

My best guess is that you have dangling python process still somehow attached to the GPU. Try and pgrep / pkill them (nvidia-smi might not show them, happened to me a few days ago after a brutal ctrl+c)

So I have reworked my Deepspeech_train.sh file as of today to alleviate those issues prior that I discussed. Yesterday it was able to train for multiple epochs (before I had to terminate it). My new .sh script is below but I really didn’t change anything major that would impact data usage so I am curious to as why I am OOMing now. I’m keeping the batch_sizes at 1. I also cleared the checkpoints folders so it starts fresh.

I imagine it is hardware related. The GPU is at 0-1% utilization on windows task manager (I am training through WSL and docker with my GPUs enabled). When starting deepspeech it jumps up to about 10-11% utilization but is also near maxing out the dedicated RAM usage (it’s a tiny card with only 2GB of VRAM). I am still perplexed on why it worked well yesterday but today it’s a no-go. I attached an image of what the GPU is doing while nothing is running.

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
echo “Please make sure you run this from DeepSpeech’s top level directory.”
exit 1
fi;

export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
export TF_FORCE_GPU_ALLOW_GROWTH=true

python -u DeepSpeech.py
–train_files //tmp/external/google_cmds_csvs/train.csv
–test_files //tmp/external/google_cmds_csvs/test.csv
–dev_files //tmp/external/google_cmds_csvs/dev.csv
–epochs 1
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–export_dir //tmp/external/deepspeech_models/googlecommands/
–checkpoint_dir //tmp/external/model_chkpts/ \
“$@”

–alphabet_config_path //DeepSpeech/data/alphabet.txt \

–scorer_path //tmp/external/deepspeech-0.7.1-models.scorer # \

–n_hidden 100 \

image

Have you checked what I said ? If it worked just a few minutes ago, that’s likely the reason.

Please take note this is not something we support.

Again, you brutally CTRL+C’d, please pgrep for hanging processes. Also, GPU resources handling in your WSL context might be different.

1 Like

GPU usage will improve with batch size. Memory reported usage might not reflect actual real consumption because TensorFlow does allocate everything by default.

Not sure this is still working, but we have --use_allow_growth True you can use for sure.

So thanks to your advice. I just shutdown my container and restarted it. Now it is training. There was probably lingering stuff going on in the container that was sucking up my GPU.

1 Like

There is, you need to set to 0 or 1 epoch, and/or ensure you pass --load to load the proper checkpoint

So I used --load_checkpoint_dir on that checkpoints folder and set --epochs 0. It is now running the checkpoint on my test data via “test epoch”. When this is finished testing, I will have a completed model in my --export_dir for inference?

Aside: Thanks again for all the help @lissyx . I know you have to answer lots of silly questions on this forum but I cannot thank you enough for the quick help you have given at solving many of my issues when getting DeepSpeech running. Your assistance is extremely helpful to me.

1 Like