Is DeepSpeech not meant for one word audio files?

It is obviously not enough data nor training

It’s not something we have had issues with on a proper model, so it should work as long as you have enough training material.

We can’t help if you are not more explicit

What are you training for? Maybe you don’t need to train from scratch …

Plot please, but that sounds like textbook overfitting. Again: dataset size, model hyper-parameters, etc. You need to do your own adjustements.

Again, please explain what is your goal.

My current goal is to just get comfortable with DeepSpeech training within the computing environment I have available. The purpose of my post was to just get feedback on this data volume and why it might be giving single letter inference and/or blank inference with the addition of a scorer.

In the future, I will likely be leveraging Transfer Learning/Fine-Tuning due to the lack of large volumes of data in the domain I am studying.

I mainly want to assess requirements and understand what is enough/what is not enough data, these are all very vague requirements in the realm of neural networks so I want to become comfortable with what I will need in order to create a proper model.

The data I will use for my domain is currently being collected so in the meantime I am just trying to train DeepSpeech with a reputable dataset. Although, Google Commands doesn’t seem to be optimal for this task due to the lack of volume. I may try transfer learning with it and see how that goes.

Please, document that so we can properly help you.

it’s not vague, it depends on your application, on your requirements. It’s documented that the default setup requires thousands hours of audio data to start producing correctly usable results, but producing a model is not a off-the-shelf operation.

If you don’t share more informations on what is your goal, it’s complicated. Maybe, again, you don’t have to collect that much of data.

My computing environment is currently utilizing an Nvidia Quadro P690 with 2GB of VRAM). In the future I will be receiving some additional cards to bring this up to 8GB of VRAM total allowing for larger batch size and quicker training (then I will be able to experiment with tuning hyperparameters, etc.). Currently with just the 2GB of GPU ram it is not feasible for me to prototype multiple models and adjusting hyperparameters in a timely fashion.

The domain of data I am exploring is air traffic audio (English speaking, english accent). There isn’t a large amount of this data transcribed out there so I am working on that with other transcription sources. I have read several papers of augmenting band pass filters, gain, etc. on clean audio to simulate this but it is quite experimental and ultimately I am waiting on better hardware before I can efficiently prototype models.

You’re telling me that I need thousands of hours of audio to train from scratch? That is useful information. I plan on using transfer learning down the road and am trying to assess how many hours of audio I will need on top of the DeepSpeech model to get good results.

Please be careful, batch size is dependant one one GPU’s RAM, as much as I recall. You might not be able to fit as much as you expect with 4x2GB.

Can you elaborate why the generic english model with a dedicated scorer and/or fine-tuning would not work? Because I don’t see why it could not.

Yes, that would help.

Well, again, that depends on your problem. If ATC is a very narrow language, you might be able to get something very good with much less data, maybe hundreds of hours, a dedicated external scorer.

Is this part of your job assignment or is this done on your free time? I’m curious, regarding your lack of GPU, because I’m trying some things on that matter.

The quality of audio for air traffic data is incredibly poor. Lots of additional noise, sounds, muffled speaking, etc. I have tried using the DeepSpeech model out of the box on some Air traffic audio and the results are poor.

I do believe using the DeepSpeech model as a base for fine-tuning will be appropriate (after I have the data gathered, this will take a few months).

For now I am just trying to proof of concept DeepSpeech and understand it’s configuration

Job assignment. In my free time I generally have worked with CNNs and image data so RNNs are a new venture for me.

How much?

I’m curious of your input data, is it 8kHz or 16kHz? Mono or Stereo? Those would be very important if you look into fine-tuning or transfer-learning.

I’m wondering how much you can just adapt existing framework to your problem. Given the quality, I really wonder if you had a look at:

  • some low/high pass filtering,
  • some de-noising (rnnnoise was told to be quite good)

I suspect you’ve run experiments already, so if you could elaborate more on those, that might help us understand how far you are from your goal and see if there’s something to help about in the meantime

As for inference results, I can’t really provide a large amount since the amount of data being collected is still undergoing collection. In ad-hoc cases though the results definitely exceed 90% WER at inference with the Deepspeech0.7.x model and scorer.

I have scripting that converts my files to 16k khz before I ever use them for deepspeech inference or training. As for stereo vs mono, I have to assess. DeepSpeech uses mono, correct? I can convert them to mono if necessary.

I also have scripting that can utilize low/high and band pass filters on my data. I have little signal processing background but I have found some success in various parameters here.

You can get accurate picture with a few hours already

It’d be interesting to check the CER as well.

You convert to 16kHz from what?

Our model is trained on 16kHz mono, if you feed anything else it will produce erratic results. Our example binaries might do automatic down/up-sampling and stereo to mono, but it might introduce glitches, so in specific cases like yours it’s always beneficial to completely control this.

Some of the air traffic audio I have obtained comes from Youtube, in which I pull the audio in .mp4, convert to .mp3 at 44100 khz and then I downsample to 16k, export as wav. I’ll need to convert to mono in this step as well. As for the optimal order of processing, does it matter?

Hard to tell, I don’t think it should impact, but we’d be curious to know your feedback on this processing.

Got it. I’ll go through my files and convert and see. Will probably be able to provide some WER and CER benchmarks pre and post conversion in the next few days.

This may be a bit of an aside but in reference to the title of my topic. I benchmarked the 0.7.4 release model on the Google Commands test set I am using which is about 6400 one second audio one word clips.

With Scorer: WER 48%, CER 32%
Without Scorer (just model use at inference time): WER 42%, CER 23%

Thought I would share. I know that making my own scorer for this data is challenging since my corpus is strictly unigrams and Kenlm doesn’t support unigram order language models. I am wondering if it is due to the Mozilla Scorer being built primarily on sentences that causes the .scorer to hurt me out of the box.

Regardless, thought it was important to share. I am going to attempt fine-tuning the 0.7.4 release on Google Commands with a smaller learning rate. Not sure how many epochs I will need to be effective but I’ll iterate through a few different combinations, although will probably be slow while my hardware is currently lacking.

1 Like

There’s a trick you can use as workaround, just add a single sentence with more words, maybe some not used in your benchmark.

Interesting. So add a single sentence with random words and use arpa order = # of words in that sentence? How does this get around it?

Yes. I think because kenlm now can build a #-gram, it doesn’t raise the error. And the rest of the #-grams are 1-grams. I did use this to run a benchmark and the 1-gram results were worse than the 3-gram results but better than a not specialized language model, so I’m assuming this approach is not bad^^

1 Like

Thanks for the hack. I will try it myself. :grinning:

@lissyx Working on fine-tuning. Running into an error. My thoughts suspect OOM but not sure.

Cuda and cuDNN info:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
root@d7284da3dc5c:/DeepSpeech# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5

fine-tuning bash script:

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;


export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
#export TF_FORCE_GPU_ALLOW_GROWTH=true

python3 -u DeepSpeech.py \
  --train_files //tmp/external/google_cmds_csvs/train.csv \
  --test_files  //tmp/external/google_cmds_csvs/test.csv \
  --dev_files  //tmp/external/google_cmds_csvs/dev.csv \
  --epochs 5 \
  --train_batch_size 1 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ \
  --use_allow_growth  \
  --n_hidden 2048 \
  --train_cudnn  \
  --checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/  \
  "$@"

I imagine this error means my GPU is running out of memory? Probably can’t handle the n_hidden 2048? Even with batch sizes equal to 1…

I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 25.073557
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 22.607900
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 4 | Loss: 19.332927
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 5 | Loss: 18.299621
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 6 | Loss: 22.737631
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 7 | Loss: 21.567622
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 8 | Loss: 23.387867
2020-07-30 15:45:48.370974: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-07-30 15:45:48.371022: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:830] Non-OK-status: GpuLaunchKernel(ColumnReduceKernel<IN_T, OUT_T, Op>, grid_dim, block_dim, 0, cu_stream, in, out, extent_x, extent_y, op, init) status: Internal: unknown error
Fatal Python error: Aborted

Thread 0x00007f5199679700 (most recent call first):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
  File "/usr/lib/pyAborted