So here’s my problem, I’m trying to create a personal healthcare assistant for healthcare providers. I only have a specific set(23 in total) of commands that the assistant recognizes. I want to overfit the Deepspeech model on these sentences as these are few in number. I want to do this such that I can have a high accuracy on a small amount of data.
How I go about doing that is I have 6 samples of each command from a different speaker. My validation and test data is the training data itself (overfitting right? ). However after continuing training for 3 epochs from a pretrained model results in it just predicting the letter h.
The following is the log after training the model:
Computing acoustic model predictions...
100% (46 of 46) |######################################################################################################| Elapsed Time: 0:01:39 Time: 0:01:39
Decoding predictions...
100% (46 of 46) |######################################################################################################| Elapsed Time: 0:01:16 Time: 0:01:16
Test - WER: 10.146552, CER: 3.705833, loss: 196.455765
--------------------------------------------------------------------------------
WER: 24.500000, CER: 96.000000, loss: 166.814606
- src: "next patient"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 22.000000, CER: 86.000000, loss: 123.011139
- src: "whos next"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 21.500000, CER: 85.000000, loss: 130.863281
- src: "next patient"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 20.500000, CER: 79.000000, loss: 116.693535
- src: "whos next"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 20.333333, CER: 120.000000, loss: 190.458618
- src: "my first patient"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 17.000000, CER: 98.000000, loss: 200.597824
- src: "how many appointments"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 15.500000, CER: 119.000000, loss: 207.080734
- src: "whos my first patient"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 14.666667, CER: 85.000000, loss: 176.761948
- src: "my first patient"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 14.000000, CER: 81.000000, loss: 129.585342
- src: "who is next"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 14.000000, CER: 81.000000, loss: 155.495468
- src: "who is next"
- res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
If anyone can help out in identifying the problem in either the training steps or the data quantity/quality. I’m also in need of any different solutions anyone can suggest in approaching this specific requirement. Thanks.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Could you please start by making your console output properly formatted as code? It’s hard to distinguish what is output and what is your comment / question.
Here’s the command I’m running. I’ve downloaded the checkpoints compressed file specified and extracted the contents into the “fine_tuning_checkpoints” directory.
I’m running the code on an Ubuntu machine with 125Gigs of RAM
/furqan/Projects$ uname -a
Linux DS0211 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Here you states continuing training for three epochs, but the command line you share is about one extra epoch, that’s inconsistent.
Having experimented on that, I could get very efficient result by just building a language model made of those commands. Maybe you should give a try to that?
Why would you absolutely require re-training ? Are those command using domain specific language that is unlikely to be properly recognized by default ?
Thanks @reuben. Turns out I was specifying the checkpoint directory the wrong way. I was not giving the --checkpoint_dir argument along with the directory itself.
@lissyx training from scratch on the data I have or by changing the language model is not desirable as the language is still the English language. And the predictions are also part of the english language (if that makes sense).
I can report that the issue has been resolved and after training the model for 5 epochs its now giving near perfect results (still some kinks that need to be ironed out).
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
I’m not sure what you say here, but I can assure you just making a specific language model with the english model works very well. So I don’t get your point.
Sorry for bumping an old thread, but could you please share this demo which you just mentioned?
I have a similar use case (a very specific use-case vocabulary with only 73 distinct words). I generated a text file containing all possible legal combinations of those words, it had around 200000 lines, and has size 4.4MB. I generated the scorer package using these files.
Then, I used this scorer combined with the pre-trained v0.7.0 model.pbmm file to run the vad_transcriber example. However, I did not get good results, in fact, my transcription outputs are worse than were when I used the the original scorer.
@sigma_g it sounds, generically, like you are doing the right thing, maybe there’s some detail you missed? In what way would the transcription fail? Was there anything systematic about the failures?
Thanks for your quick reply! For example, I said “queen a takes b four” but the output was “horse b to”. I changed value of --aggressive from 0 to 3 without success. When recorded without background noise (ceiling fan), it generated “rex b four”.
I am recording on a 22kHz headset microphone and downsampling to 16kHz using sox. I say one word per second, and it’s quite clear to me when I hear the downsampled wav file myself.
I have also tried the mic vad streaming example, and it does not produce good transcription either.
Is there anything else that is needed to be done?
PS: correction to my earlier remark, transcription is actually worse when using the pretrained v0.7.0 scorer (it generates some non-chess gibberish, which is kinda expected since it is a general english language scorer ).
How about you start a new thread and we’ll answer all your questions there? As there are some people having similar ideas (maybe for Go ), they could find it all in one thread for 0.7.1.
Just start a new thread with what you did up until now and I’ll have sth to add:-)