Running Deepspeech 0.7.4 on Google Commands Dataset

Epoetin · July 22, 2020, 10:09pm

Perfect. Thanks! I can’t believe I missed that ending backslash. I’ll have to fix that in my next training and see if that fixes it.

Epoetin · July 23, 2020, 12:54pm

Hi again @othiele . How does one go about creating a model from checkpoint files? I want to take the checkpoint files I have, and create a .pb file with it so I can convert to .pbmm and then try some benchmarking with the deepspeech client.

lissyx · July 23, 2020, 1:08pm

It’s in the training documentation, export section.

Epoetin · July 23, 2020, 1:18pm

My model was stopped before it finished training. There is nothing in the --export_dir path I specified. I just have a folder of summaries and checkpoints.

I am not seeing how this documentation can assist me. Maybe I am looking in the wrong section?

Checkpointing

During training of a model so-called checkpoints will get stored on disk. This takes place at a configurable time interval. The purpose of checkpoints is to allow interruption (also in the case of some unexpected failure) and later continuation of training without losing hours of training time. Resuming from checkpoints happens automatically by just (re)starting training with the same --checkpoint_dir of the former run. Alternatively, you can specify more fine grained options with --load_checkpoint_dir and --save_checkpoint_dir , which specify separate locations to use for loading and saving checkpoints respectively. If not specified these flags use the same value as --checkpoint_dir , ie. load from and save to the same directory.

Be aware however that checkpoints are only valid for the same model geometry they had been generated from. In other words: If there are error messages of certain Tensors having incompatible dimensions, this is most likely due to an incompatible model change. One usual way out would be to wipe all checkpoint files in the checkpoint directory or changing it before starting the training.

Exporting a model for inference

If the --export_dir parameter is provided, a model will have been exported to this directory during training. Refer to the usage instructions for information on running a client that can use the exported model.

Exporting a model for TFLite

If you want to experiment with the TF Lite engine, you need to export a model that is compatible with it, then use the --export_tflite flags. If you already have a trained model, you can re-export it for TFLite by running DeepSpeech.py again and specifying the same checkpoint_dir that you used for training, as well as passing --export_tflite --export_dir /model/export/destination .

lissyx · July 23, 2020, 1:21pm

How about the two “exporting” parts? That’s what you want.

Then you need to share logs if you need help on why it stopped or finish training.

Epoetin · July 23, 2020, 1:25pm

I stopped it on purpose with ctrl+c because my .sh script was apparently incorrect (I have it posted above in this thread) and it was training for more than the --epochs 1 argument I specified. I had to stop the training or else I’d be waiting another 300 hours for it to finish since the default epochs is 75. I only wanted to train it for one epoch and it went over that, thus I force stopped it.

I am inclined to believe that since my .sh script had issues that it didn’t recognize my --export_dir argument and thus never exported anything.

lissyx · July 23, 2020, 1:28pm

Why? You stopped in the middle of training manually, we perform the export after the training, so it’s normal there has been no export.

Epoetin · July 23, 2020, 1:33pm

It trained through 2 complete epochs but due to it taking 6 and a half hours per epoch, I needed to cancel it since it would keep training due to the --epochs 1 argument not stopping at 1.

So there is no way to recover a .pb from the checkpoints? I need to train again? It is not the end of the world but I just want to assess.

However, I am now getting OOM errors after trying to train fresh with cleared checkpoints

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Traceback (most recent call last):
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [1,19,26,494] and type float
[[{{node tower_0/conv1d/ExpandDims_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 955, in run_script
absl.app.run(main)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/DeepSpeech/training/deepspeech_training/train.py”, line 927, in main
train()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 595, in train
train_loss, _ = run_set(‘train’, epoch, train_init_op)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 560, in run_set
feed_dict=feed_dict)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 956, in run
run_metadata_ptr)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1180, in _run
feed_dict_tensor, options, run_metadata)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [1,19,26,494] and type float
[[node tower_0/conv1d/ExpandDims_1 (defined at /DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for ‘tower_0/conv1d/ExpandDims_1’:
File “DeepSpeech.py”, line 12, in
ds_train.run_script()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 955, in run_script
absl.app.run(main)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “/DeepSpeech/training/deepspeech_training/train.py”, line 927, in main
train()
File “/DeepSpeech/training/deepspeech_training/train.py”, line 473, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 312, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 239, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 167, in create_model
batch_x = create_overlapping_windows(batch_x)
File “/DeepSpeech/training/deepspeech_training/train.py”, line 69, in create_overlapping_windows
batch_x = tf.nn.conv1d(input=batch_x, filters=eye_filter, stride=1, padding=‘SAME’)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 574, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 574, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py”, line 1672, in conv1d
filters = array_ops.expand_dims(filters, 0)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py”, line 180, in wrapper
return target(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py”, line 265, in expand_dims
return expand_dims_v2(input, axis, name)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py”, line 180, in wrapper
return target(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py”, line 314, in expand_dims_v2
return gen_array_ops.expand_dims(input, axis, name)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py”, line 2465, in expand_dims
“ExpandDims”, input=input, dim=axis, name=name)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py”, line 794, in _apply_op_helper
op_def=op_def)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3357, in create_op
attrs, op_def, compute_device)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 3426, in _create_op_internal
op_def=op_def)
File “/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py”, line 1748, in init
self._traceback = tf_stack.extract_stack()

lissyx · July 23, 2020, 1:36pm

There is, you need to set to 0 or 1 epoch, and/or ensure you pass --load to load the proper checkpoint

But when you CTRL+C you interrupted completely the flow

No idea why, and hard to debug without more context (command line, hardware, …)

lissyx · July 23, 2020, 1:37pm

Nevermind, you shared those at the begining.

lissyx · July 23, 2020, 1:39pm

Can’t find any doc for Nvidia Quadro P690, only one I found I P620 and that indeeed seems old.

My best guess is that you have dangling python process still somehow attached to the GPU. Try and pgrep / pkill them (nvidia-smi might not show them, happened to me a few days ago after a brutal ctrl+c)

Epoetin · July 23, 2020, 1:48pm

So I have reworked my Deepspeech_train.sh file as of today to alleviate those issues prior that I discussed. Yesterday it was able to train for multiple epochs (before I had to terminate it). My new .sh script is below but I really didn’t change anything major that would impact data usage so I am curious to as why I am OOMing now. I’m keeping the batch_sizes at 1. I also cleared the checkpoints folders so it starts fresh.

I imagine it is hardware related. The GPU is at 0-1% utilization on windows task manager (I am training through WSL and docker with my GPUs enabled). When starting deepspeech it jumps up to about 10-11% utilization but is also near maxing out the dedicated RAM usage (it’s a tiny card with only 2GB of VRAM). I am still perplexed on why it worked well yesterday but today it’s a no-go. I attached an image of what the GPU is doing while nothing is running.

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
echo “Please make sure you run this from DeepSpeech’s top level directory.”
exit 1
fi;

export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
export TF_FORCE_GPU_ALLOW_GROWTH=true

python -u DeepSpeech.py
–train_files //tmp/external/google_cmds_csvs/train.csv
–test_files //tmp/external/google_cmds_csvs/test.csv
–dev_files //tmp/external/google_cmds_csvs/dev.csv
–epochs 1
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–export_dir //tmp/external/deepspeech_models/googlecommands/
–checkpoint_dir //tmp/external/model_chkpts/ \
“$@”

–alphabet_config_path //DeepSpeech/data/alphabet.txt \

–scorer_path //tmp/external/deepspeech-0.7.1-models.scorer # \

–n_hidden 100 \

lissyx · July 23, 2020, 1:49pm

Have you checked what I said ? If it worked just a few minutes ago, that’s likely the reason.

Please take note this is not something we support.

Again, you brutally CTRL+C’d, please pgrep for hanging processes. Also, GPU resources handling in your WSL context might be different.

lissyx · July 23, 2020, 1:52pm

GPU usage will improve with batch size. Memory reported usage might not reflect actual real consumption because TensorFlow does allocate everything by default.

Not sure this is still working, but we have --use_allow_growth True you can use for sure.

Epoetin · July 23, 2020, 1:53pm

So thanks to your advice. I just shutdown my container and restarted it. Now it is training. There was probably lingering stuff going on in the container that was sucking up my GPU.

Epoetin · July 23, 2020, 2:15pm

There is, you need to set to 0 or 1 epoch, and/or ensure you pass --load to load the proper checkpoint

So I used --load_checkpoint_dir on that checkpoints folder and set --epochs 0. It is now running the checkpoint on my test data via “test epoch”. When this is finished testing, I will have a completed model in my --export_dir for inference?

Aside: Thanks again for all the help @lissyx . I know you have to answer lots of silly questions on this forum but I cannot thank you enough for the quick help you have given at solving many of my issues when getting DeepSpeech running. Your assistance is extremely helpful to me.

lissyx · July 23, 2020, 2:26pm

Yes. If you don’t pass --test_files it would have skipped the test phase

Epoetin · July 23, 2020, 2:33pm

Great. It worked. It exported. Looks like it is giving blank inferences for my files in the test set though.

This is a peculiar dataset since the utterances are only 1 second long, I am curious if using a scorer is beneficial for this type of data or it might be better without.

Is having blank inferences usually an issue due to not training enough? Obviously 2 epochs of training won’t be enough for a good model but I’d hope to at least get some character predictions. I just want to make sure it isn’t something else that is going astray.

lissyx · July 23, 2020, 2:34pm

Yeah, with this few training it’s not really surprising you get blank inference

reuben · July 24, 2020, 7:01am

Just to mention one thing which I noticed in many of the scripts you shared here: you can’t comment out a single line in a command of the form

command \
    param1 \
    param2 \
    param3

In Bash a comment goes from the # character until the end of the line. Using \ means you’re escaping the newline, so the entire thing is seen by bash as a single line. If you put a comment somewhere, then everything after that point gets commented out.