Is DeepSpeech not meant for one word audio files?

Job assignment. In my free time I generally have worked with CNNs and image data so RNNs are a new venture for me.

How much?

I’m curious of your input data, is it 8kHz or 16kHz? Mono or Stereo? Those would be very important if you look into fine-tuning or transfer-learning.

I’m wondering how much you can just adapt existing framework to your problem. Given the quality, I really wonder if you had a look at:

  • some low/high pass filtering,
  • some de-noising (rnnnoise was told to be quite good)

I suspect you’ve run experiments already, so if you could elaborate more on those, that might help us understand how far you are from your goal and see if there’s something to help about in the meantime

As for inference results, I can’t really provide a large amount since the amount of data being collected is still undergoing collection. In ad-hoc cases though the results definitely exceed 90% WER at inference with the Deepspeech0.7.x model and scorer.

I have scripting that converts my files to 16k khz before I ever use them for deepspeech inference or training. As for stereo vs mono, I have to assess. DeepSpeech uses mono, correct? I can convert them to mono if necessary.

I also have scripting that can utilize low/high and band pass filters on my data. I have little signal processing background but I have found some success in various parameters here.

You can get accurate picture with a few hours already

It’d be interesting to check the CER as well.

You convert to 16kHz from what?

Our model is trained on 16kHz mono, if you feed anything else it will produce erratic results. Our example binaries might do automatic down/up-sampling and stereo to mono, but it might introduce glitches, so in specific cases like yours it’s always beneficial to completely control this.

Some of the air traffic audio I have obtained comes from Youtube, in which I pull the audio in .mp4, convert to .mp3 at 44100 khz and then I downsample to 16k, export as wav. I’ll need to convert to mono in this step as well. As for the optimal order of processing, does it matter?

Hard to tell, I don’t think it should impact, but we’d be curious to know your feedback on this processing.

Got it. I’ll go through my files and convert and see. Will probably be able to provide some WER and CER benchmarks pre and post conversion in the next few days.

This may be a bit of an aside but in reference to the title of my topic. I benchmarked the 0.7.4 release model on the Google Commands test set I am using which is about 6400 one second audio one word clips.

With Scorer: WER 48%, CER 32%
Without Scorer (just model use at inference time): WER 42%, CER 23%

Thought I would share. I know that making my own scorer for this data is challenging since my corpus is strictly unigrams and Kenlm doesn’t support unigram order language models. I am wondering if it is due to the Mozilla Scorer being built primarily on sentences that causes the .scorer to hurt me out of the box.

Regardless, thought it was important to share. I am going to attempt fine-tuning the 0.7.4 release on Google Commands with a smaller learning rate. Not sure how many epochs I will need to be effective but I’ll iterate through a few different combinations, although will probably be slow while my hardware is currently lacking.

1 Like

There’s a trick you can use as workaround, just add a single sentence with more words, maybe some not used in your benchmark.

Interesting. So add a single sentence with random words and use arpa order = # of words in that sentence? How does this get around it?

Yes. I think because kenlm now can build a #-gram, it doesn’t raise the error. And the rest of the #-grams are 1-grams. I did use this to run a benchmark and the 1-gram results were worse than the 3-gram results but better than a not specialized language model, so I’m assuming this approach is not bad^^

1 Like

Thanks for the hack. I will try it myself. :grinning:

@lissyx Working on fine-tuning. Running into an error. My thoughts suspect OOM but not sure.

Cuda and cuDNN info:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
root@d7284da3dc5c:/DeepSpeech# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5

fine-tuning bash script:

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;


export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
#export TF_FORCE_GPU_ALLOW_GROWTH=true

python3 -u DeepSpeech.py \
  --train_files //tmp/external/google_cmds_csvs/train.csv \
  --test_files  //tmp/external/google_cmds_csvs/test.csv \
  --dev_files  //tmp/external/google_cmds_csvs/dev.csv \
  --epochs 5 \
  --train_batch_size 1 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ \
  --use_allow_growth  \
  --n_hidden 2048 \
  --train_cudnn  \
  --checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/  \
  "$@"

I imagine this error means my GPU is running out of memory? Probably can’t handle the n_hidden 2048? Even with batch sizes equal to 1…

I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 25.073557
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 22.607900
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 4 | Loss: 19.332927
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 5 | Loss: 18.299621
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 6 | Loss: 22.737631
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 7 | Loss: 21.567622
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 8 | Loss: 23.387867
2020-07-30 15:45:48.370974: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-07-30 15:45:48.371022: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:830] Non-OK-status: GpuLaunchKernel(ColumnReduceKernel<IN_T, OUT_T, Op>, grid_dim, block_dim, 0, cu_stream, in, out, extent_x, extent_y, op, init) status: Internal: unknown error
Fatal Python error: Aborted

Thread 0x00007f5199679700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/pyAborted

Sorry, first time I see that error …

I tried adding in my own --learning_rate == 0.0001

I think I have seen this error before in other discourse topics. Not sure what it means.

Error:

+ [ ! -f DeepSpeech.py ]
+ export NVIDIA_VISIBLE_DEVICES=0
+ export CUDA_VISIBLE_DEVICES=0
+ python3 -u DeepSpeech.py --train_files //tmp/external/google_cmds_csvs/train.csv --test_files //tmp/external/google_cmds_csvs/test.csv --dev_files //tmp/external/google_cmds_csvs/dev.csv --epochs 5 --train_batch_size 1 --dev_batch_size 1 --test_batch_size 1 --export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ --use_allow_growth --n_hidden 2048 --train_cudnn --learning_rate 0.0001 --checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/
I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 2 | Loss: 23.942448
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
         [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 595, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 560, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
         [[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 473, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 321, in get_tower_results
    gradients = optimizer.compute_gradients(avg_loss)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/cudnn_rnn_grad.py", line 104, in _cudnn_rnn_backwardv3
    direction=op.get_attr("direction")) + (None,)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 749, in cudnn_rnn_backprop_v3
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'tower_0/cudnn_lstm/CudnnRNNV3', defined at:
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
[elided 4 identical lines from previous traceback]
  File "/DeepSpeech/training/deepspeech_training/train.py", line 473, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 312, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 239, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 190, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 128, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)

So, you might be hitting https://github.com/tensorflow/tensorflow/issues/41630
Please try with TF_CUDNN_RESET_RND_GEN_STATE=1 env var

It can also be just a symptom of GPU OOM.

Yeah retried with that env var and still same error. stops at about 7th step with 1 batch size. Is there a large amount of memory required for running cudNN with n_hidden 2048 size?

If so, I may have to work on getting more GPU power before I can successfully fine-tune.

Try reducing the network size, you’ll know quite fast

Yep. Tried to train with --n_hidden 100 with --train_cudnn and it was able to train for awhile. Maybe my GPU just can’t handle cudnn. My gpu usage skyrockets whenever I use that flag.