Overfitting DeepSpeech model on small amount of data

fam4792 · March 12, 2019, 10:44am

So here’s my problem, I’m trying to create a personal healthcare assistant for healthcare providers. I only have a specific set(23 in total) of commands that the assistant recognizes. I want to overfit the Deepspeech model on these sentences as these are few in number. I want to do this such that I can have a high accuracy on a small amount of data.

How I go about doing that is I have 6 samples of each command from a different speaker. My validation and test data is the training data itself (overfitting right? ). However after continuing training for 3 epochs from a pretrained model results in it just predicting the letter h.

The following is the log after training the model:

Computing acoustic model predictions...
100% (46 of 46) |######################################################################################################| Elapsed Time: 0:01:39 Time:  0:01:39
Decoding predictions...
100% (46 of 46) |######################################################################################################| Elapsed Time: 0:01:16 Time:  0:01:16
Test - WER: 10.146552, CER: 3.705833, loss: 196.455765
--------------------------------------------------------------------------------
WER: 24.500000, CER: 96.000000, loss: 166.814606
 - src: "next patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 22.000000, CER: 86.000000, loss: 123.011139
 - src: "whos next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 21.500000, CER: 85.000000, loss: 130.863281
 - src: "next patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 20.500000, CER: 79.000000, loss: 116.693535
 - src: "whos next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 20.333333, CER: 120.000000, loss: 190.458618
 - src: "my first patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 17.000000, CER: 98.000000, loss: 200.597824
 - src: "how many appointments"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 15.500000, CER: 119.000000, loss: 207.080734
 - src: "whos my first patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 14.666667, CER: 85.000000, loss: 176.761948
 - src: "my first patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 14.000000, CER: 81.000000, loss: 129.585342
 - src: "who is next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 14.000000, CER: 81.000000, loss: 155.495468
 - src: "who is next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "

If anyone can help out in identifying the problem in either the training steps or the data quantity/quality. I’m also in need of any different solutions anyone can suggest in approaching this specific requirement. Thanks.

lissyx · March 12, 2019, 10:35am

Could you please start by making your console output properly formatted as code? It’s hard to distinguish what is output and what is your comment / question.

fam4792 · March 12, 2019, 10:43am

I’ve updated the code in the post to the appropriate format.Check now please.

lissyx · March 12, 2019, 10:49am

I guess it’d be very important you also share the command line you use.

fam4792 · March 12, 2019, 11:01am

Here’s the command I’m running. I’ve downloaded the checkpoints compressed file specified and extracted the contents into the “fine_tuning_checkpoints” directory.

python3 DeepSpeech.py --n_hidden 2048 --fine_tuning_checkpoints/ --epoch -1 --train_files /home/furqan/Projects/assistant/audio_data/downsampled/training.csv --dev_files /home/furqan/Projects/assistant/audio_data/downsampled/training.csv --test_files /home/furqan/Projects/assistant/audio_data/downsampled/training.csv --learning_rate 0.0001

I’m running the code on an Ubuntu machine with 125Gigs of RAM

/furqan/Projects$ uname -a
Linux DS0211 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

lissyx · March 12, 2019, 11:02am

Here you states continuing training for three epochs, but the command line you share is about one extra epoch, that’s inconsistent.

Having experimented on that, I could get very efficient result by just building a language model made of those commands. Maybe you should give a try to that?

Why would you absolutely require re-training ? Are those command using domain specific language that is unlikely to be properly recognized by default ?

reuben · March 12, 2019, 1:08pm

Looks like you forgot to specify the --checkpoint_dir flag so it’s starting a new training from scratch rather than fine tuning the release model.

fam4792 · March 13, 2019, 10:42am

Thanks @reuben. Turns out I was specifying the checkpoint directory the wrong way. I was not giving the --checkpoint_dir argument along with the directory itself.

@lissyx training from scratch on the data I have or by changing the language model is not desirable as the language is still the English language. And the predictions are also part of the english language (if that makes sense).

I can report that the issue has been resolved and after training the model for 5 epochs its now giving near perfect results (still some kinks that need to be ironed out).

lissyx · March 13, 2019, 11:21am

I’m not sure what you say here, but I can assure you just making a specific language model with the english model works very well. So I don’t get your point.

kdavis · March 13, 2019, 11:28am

As an alternate suggestion, I wouldn’t take the route you are taking, i.e. fine tuning the Deep Speech acoustic model.

What I would do is to simply use the existing acoustic model and create a new language model using only the 23 sentences you expect to hear.

@lissyx has done just this for a demo we made and the results are very good. This direction has the advantage of being much simpler to execute.

sigma_g · May 27, 2020, 12:09pm

Sorry for bumping an old thread, but could you please share this demo which you just mentioned?

I have a similar use case (a very specific use-case vocabulary with only 73 distinct words). I generated a text file containing all possible legal combinations of those words, it had around 200000 lines, and has size 4.4MB. I generated the scorer package using these files.

Then, I used this scorer combined with the pre-trained v0.7.0 model.pbmm file to run the vad_transcriber example. However, I did not get good results, in fact, my transcription outputs are worse than were when I used the the original scorer.

kdavis · May 27, 2020, 12:29pm

@sigma_g it sounds, generically, like you are doing the right thing, maybe there’s some detail you missed? In what way would the transcription fail? Was there anything systematic about the failures?

sigma_g · May 27, 2020, 12:51pm

Thanks for your quick reply! For example, I said “queen a takes b four” but the output was “horse b to”. I changed value of --aggressive from 0 to 3 without success. When recorded without background noise (ceiling fan), it generated “rex b four”.

I am recording on a 22kHz headset microphone and downsampling to 16kHz using sox. I say one word per second, and it’s quite clear to me when I hear the downsampled wav file myself.

I have also tried the mic vad streaming example, and it does not produce good transcription either.

Is there anything else that is needed to be done?

PS: correction to my earlier remark, transcription is actually worse when using the pretrained v0.7.0 scorer (it generates some non-chess gibberish, which is kinda expected since it is a general english language scorer ).

othiele · May 27, 2020, 1:44pm

How about you start a new thread and we’ll answer all your questions there? As there are some people having similar ideas (maybe for Go ), they could find it all in one thread for 0.7.1.

Just start a new thread with what you did up until now and I’ll have sth to add:-)