Is DeepSpeech not meant for one word audio files?

Hello. I have been running DeepSpeech over a few configurations with the Google Commands Dataset. There are 65000 total one-second 16000 framerate wav files. This equates to about 18 hours of audio. Now, I have ran the model for short periods of time (2 epoch) on 0.7.4 and my model at inference is able to predict some letters (e’s and o’s) without a scorer (with scorer it predicts blanks).

I have tried running the same data on 0.7.3 in the past with 10 epochs of training and received blank inference with and without a scorer.

I am curious of a few things:

  1. Is this probably not enough hours of audio for good results?

  2. Are one second, one word utterances not optimal for the RNN architecture? Even using the out of the box Deepspeech model and scorer, the WER I received on a random sample of 1500 Google Commands files was ~47%. Interestingly enough, results seemed to be better without the scorer (I imagine this has to do with the probability dependencies generated from the DeepSpeech corpus). Kenlm doesn’t support unigram order so it seems that using a scorer on one word utterances is not ideal.

  3. I know these amounts of epochs are low but hardware is limited at the moment. I mainly just want to prove that I can train DeepSpeech from scratch and receive tangible results (good or bad but not blank inference) and make sure it isn’t my configuration or setup that is an issue.

Aside: In both my training trials (2 epochs and 10 epochs) the training loss gets down to about ~7.25 and test/dev loss sticks around 30. I have yet to start experimenting with Dropout and other hyperparameters due to the limited hardware at the moment.

I have faith in the DeepSpeech architecture and want to ensure that I am utilizing and understanding this software in and out before I assess performance entirely.

Thank you

It is obviously not enough data nor training

It’s not something we have had issues with on a proper model, so it should work as long as you have enough training material.

We can’t help if you are not more explicit

What are you training for? Maybe you don’t need to train from scratch …

Plot please, but that sounds like textbook overfitting. Again: dataset size, model hyper-parameters, etc. You need to do your own adjustements.

Again, please explain what is your goal.

My current goal is to just get comfortable with DeepSpeech training within the computing environment I have available. The purpose of my post was to just get feedback on this data volume and why it might be giving single letter inference and/or blank inference with the addition of a scorer.

In the future, I will likely be leveraging Transfer Learning/Fine-Tuning due to the lack of large volumes of data in the domain I am studying.

I mainly want to assess requirements and understand what is enough/what is not enough data, these are all very vague requirements in the realm of neural networks so I want to become comfortable with what I will need in order to create a proper model.

The data I will use for my domain is currently being collected so in the meantime I am just trying to train DeepSpeech with a reputable dataset. Although, Google Commands doesn’t seem to be optimal for this task due to the lack of volume. I may try transfer learning with it and see how that goes.

Please, document that so we can properly help you.

it’s not vague, it depends on your application, on your requirements. It’s documented that the default setup requires thousands hours of audio data to start producing correctly usable results, but producing a model is not a off-the-shelf operation.

If you don’t share more informations on what is your goal, it’s complicated. Maybe, again, you don’t have to collect that much of data.

My computing environment is currently utilizing an Nvidia Quadro P690 with 2GB of VRAM). In the future I will be receiving some additional cards to bring this up to 8GB of VRAM total allowing for larger batch size and quicker training (then I will be able to experiment with tuning hyperparameters, etc.). Currently with just the 2GB of GPU ram it is not feasible for me to prototype multiple models and adjusting hyperparameters in a timely fashion.

The domain of data I am exploring is air traffic audio (English speaking, english accent). There isn’t a large amount of this data transcribed out there so I am working on that with other transcription sources. I have read several papers of augmenting band pass filters, gain, etc. on clean audio to simulate this but it is quite experimental and ultimately I am waiting on better hardware before I can efficiently prototype models.

You’re telling me that I need thousands of hours of audio to train from scratch? That is useful information. I plan on using transfer learning down the road and am trying to assess how many hours of audio I will need on top of the DeepSpeech model to get good results.

Please be careful, batch size is dependant one one GPU’s RAM, as much as I recall. You might not be able to fit as much as you expect with 4x2GB.

Can you elaborate why the generic english model with a dedicated scorer and/or fine-tuning would not work? Because I don’t see why it could not.

Yes, that would help.

Well, again, that depends on your problem. If ATC is a very narrow language, you might be able to get something very good with much less data, maybe hundreds of hours, a dedicated external scorer.

Is this part of your job assignment or is this done on your free time? I’m curious, regarding your lack of GPU, because I’m trying some things on that matter.

The quality of audio for air traffic data is incredibly poor. Lots of additional noise, sounds, muffled speaking, etc. I have tried using the DeepSpeech model out of the box on some Air traffic audio and the results are poor.

I do believe using the DeepSpeech model as a base for fine-tuning will be appropriate (after I have the data gathered, this will take a few months).

For now I am just trying to proof of concept DeepSpeech and understand it’s configuration

Job assignment. In my free time I generally have worked with CNNs and image data so RNNs are a new venture for me.

How much?

I’m curious of your input data, is it 8kHz or 16kHz? Mono or Stereo? Those would be very important if you look into fine-tuning or transfer-learning.

I’m wondering how much you can just adapt existing framework to your problem. Given the quality, I really wonder if you had a look at:

  • some low/high pass filtering,
  • some de-noising (rnnnoise was told to be quite good)

I suspect you’ve run experiments already, so if you could elaborate more on those, that might help us understand how far you are from your goal and see if there’s something to help about in the meantime

As for inference results, I can’t really provide a large amount since the amount of data being collected is still undergoing collection. In ad-hoc cases though the results definitely exceed 90% WER at inference with the Deepspeech0.7.x model and scorer.

I have scripting that converts my files to 16k khz before I ever use them for deepspeech inference or training. As for stereo vs mono, I have to assess. DeepSpeech uses mono, correct? I can convert them to mono if necessary.

I also have scripting that can utilize low/high and band pass filters on my data. I have little signal processing background but I have found some success in various parameters here.

You can get accurate picture with a few hours already

It’d be interesting to check the CER as well.

You convert to 16kHz from what?

Our model is trained on 16kHz mono, if you feed anything else it will produce erratic results. Our example binaries might do automatic down/up-sampling and stereo to mono, but it might introduce glitches, so in specific cases like yours it’s always beneficial to completely control this.

Some of the air traffic audio I have obtained comes from Youtube, in which I pull the audio in .mp4, convert to .mp3 at 44100 khz and then I downsample to 16k, export as wav. I’ll need to convert to mono in this step as well. As for the optimal order of processing, does it matter?

Hard to tell, I don’t think it should impact, but we’d be curious to know your feedback on this processing.

Got it. I’ll go through my files and convert and see. Will probably be able to provide some WER and CER benchmarks pre and post conversion in the next few days.

This may be a bit of an aside but in reference to the title of my topic. I benchmarked the 0.7.4 release model on the Google Commands test set I am using which is about 6400 one second audio one word clips.

With Scorer: WER 48%, CER 32%
Without Scorer (just model use at inference time): WER 42%, CER 23%

Thought I would share. I know that making my own scorer for this data is challenging since my corpus is strictly unigrams and Kenlm doesn’t support unigram order language models. I am wondering if it is due to the Mozilla Scorer being built primarily on sentences that causes the .scorer to hurt me out of the box.

Regardless, thought it was important to share. I am going to attempt fine-tuning the 0.7.4 release on Google Commands with a smaller learning rate. Not sure how many epochs I will need to be effective but I’ll iterate through a few different combinations, although will probably be slow while my hardware is currently lacking.

1 Like

There’s a trick you can use as workaround, just add a single sentence with more words, maybe some not used in your benchmark.

Interesting. So add a single sentence with random words and use arpa order = # of words in that sentence? How does this get around it?

Yes. I think because kenlm now can build a #-gram, it doesn’t raise the error. And the rest of the #-grams are 1-grams. I did use this to run a benchmark and the 1-gram results were worse than the 3-gram results but better than a not specialized language model, so I’m assuming this approach is not bad^^

1 Like

Thanks for the hack. I will try it myself. :grinning: