Not able to infer specific set of audio

Tejas_Shah · October 15, 2020, 7:00am

Hello,
I trained a model on Indian English with almost 650000 samples and around 70K test samples. Following parameters were used while training the model.
python3 -u DeepSpeech.py --noshow_progressbar
–automatic_mixed_precision True
–train_files /home/ubuntu/DataSet/Data/TrainIndianEng.csv
–train_batch_size 8
–dev_files /home/ubuntu/DataSet/Data/DevIndianEng.csv
–dev_batch_size 8
–test_files /home/ubuntu/DataSet/testData/TestIndianEng.csv
–test_batch_size 8
–max_to_keep 5
–checkpoint_dir /home/ubuntu/deepspeech/checkDir
–learning_rate 0.0001
–epochs 20
–dropout_rate 0.20
–train_cudnn True
–use_allow_growth True

The above training completes with the loss of around 13. The tests end at median of 0.30 WER.
Although the inference works correctly on the test data set, if I try to infer some other wav file of Indian English of around 5/6 minutes of voice, the output is not complete. In fact, it just outputs around 1-3 words.
Following is the output of the same.

TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:44:12.618432: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.53s.
Running inference.
> wark ter vts cuta how an ee i o a
Inference took 292.311s for 343.620s audio file.

If I infer the same wav file with a US English model which was built with LibriSpeech data, it is able to infer the following.

TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:50:32.907259: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.49s.
Running inference.
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:50:32.907259: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.49s.
Running inference.
> pr yo bo on migi te lon lettit at cand tack ther hok an i har bereand i din’t od thee bi hu banier opla bort gu il o alin a gol hluoked a bea back to me ty an’to beok dor shoulpangeni kit thim omen hell ah do il ten mymkan abit e of to contact numberr man an do te opi tine it clolitle bye no it nine nine to the opi ocan i nlang do dittle qui b wenined bortho wiel the nine pord hied ton the don de temomen that mymly no man rubing angi and you ae i col a dyu beal hill were yer to tonl ment i me i ia he conon hold pam minute nen dod on the ank you in enoo a a ao a aahaa a aaao oh ao iah i ket ot tink on tigh came up ahd ma tayk hume pie mi’ bare yin ol heu in doing a bo herkin a mong then etdo ti wout gon a ho with o lod you may thinkin than ik you ai then w im a ame be ma po ii aa a aaaah a ioohho iaaaahaah a iaoohal bonblaooaaahha ha io ohapo aaa a ant o m a hahen waa eam behol heha lao lo i u abat a lom at it u ethat omaat ecanilt mun e bha i apon a gat en o ong hanno yet mot ah shebandinie woul not be bushe dege te caldigt not cand and interco bac with tan happinar the youre ho wone me n en olr tang i do on har han dwelver no he lar gone you my you a bhe get back to me at do wete mi won obed them the un thegod a hen that by wart you gould be on the poo them let it irn thad bol in lack bove beco do ot am brobema and he have e it yo logt to gen tack to he has bo wide any at the adban by and men it martete on of get the crited din lacko di her are ight marm oh under ten bud he wh in lordy licatie pold ado know gar you not id levlen you knongoun the bold to vilivin ah maeor hap he provigion turchen the deter inn is la yeu to harbauck the tan ort and our helilad orere for to inte were heu oh den flive o’clod or car ith an him u lo a ther an you orer we my by number i had ut on he mudled be te cod then unber in that the hillinof ant oud ten o to you yeuv be the the onn nor be a bego then the goi lot my you tae don min un lork ogornly go bev er than thof te themer bat my un word od epear oi ye ye el can have you numbero anby b yor night at no do ork to lyny drew teto thau belon dola ve youd tem ete noto leven thick in’tn there notle ola in an the goden amon again emp on nine your d dot no niabe war my la do ne or noboe london heven becleat portect i go do now be o wit yow acan a giv aco batmit before by book loco care he mantei ah my ning it hadh lord let i atelot be i ar eago bang lok my my hammam pank you thank it le hodetn ir oren
> Inference took 292.779s for 343.620s audio file.

The output with LibriSpeech model is not correct but atleast it was able to output a lot of text which I expect with an audio of 5-6 minutes.
I feel the audio file should be correct as was shown with LibriSpeech example. However, I dont understand the reason for bad output with Indian English voice samples.
Can someone help to debug this situation further? The Indian English model can not be faulty as it is able to infer the test data. The audio file can not be faulty as US English model was able to infer lot of text.
Is there a way to debug the situation?

othiele · October 15, 2020, 8:23am

Inference without scorer to see raw output
It is unclear what you used for dev, this should be great material
Use a higher dropout of 0.3 or 0.4
You may be overfitting, 20 epochs should give you somewhat good results for good input material
Use higher batch size, which makes the training just faster not better

Tejas_Shah · October 15, 2020, 9:52am

Hi Olaf, Thanks for a reply.
The inference shown in my earlier message was without scorer. I tried to use scorer but it does not improve the situation.
Dev dataset is also from the same dataset used for training and testing. Attaching herewith is thmodel_build_IndianEng_libri.pdf (33.8 KB) e output of the training process. I have used 20 epochs for training.

othiele · October 15, 2020, 3:18pm

Again:

What do you use for dev and how much?
Dropout is low depending on the material, what do you use?

Tejas_Shah · October 15, 2020, 3:58pm

Here are the datasets.
Train Dataset: 650K samples
Dev Dataset: 60K Samples
Test Dataset: 70K Samples

Dropout used is 0.2.

othiele · October 15, 2020, 6:45pm

Not too bad. Check that the dev set does not contain many problematic chunks and use a higher dropout. I would go with 0.3 or higher.

As for the WER I am not sure whether you should use that without a scorer as you would need the exact words. Your CER looks good. You could continue training and see whether it improves or you retrain with the higher dropout.

Tejas_Shah · October 19, 2020, 11:21am

Thanks @othiele.
Based on your suggestions, I restarted the training from scratch with 0.4 as dropout rate.
I trained the model in two phases - in the first phase, I tarined for 5 epochs and then the next one ran for 10 epochs. Surprisingly, the results after the pass1 is much better than the result after pass2. Not sure why the inference results are worsening with more training.
Here is the output of each pass:
Pass1:
I Training epoch 0…
I Finished training epoch 0 - loss: 31.696602
I Validating epoch 0 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 0 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 29.279222

I Training epoch 1…
I Finished training epoch 1 - loss: 25.797112
I Validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 26.403398
I Saved new best validating model with loss 26.403398

I Training epoch 2…
I Finished training epoch 2 - loss: 23.697474
I Validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 23.351679
I Saved new best validating model with loss 23.351679

I Training epoch 3…
I Finished training epoch 3 - loss: 22.394458
I Validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 22.789643
I Saved new best validating model with loss 22.789643

I Training epoch 4…
I Finished training epoch 4 - loss: 21.499197
I Validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 21.611039
I Saved new best validating model with loss 21.611039

Pass2 ended with loss of 17.92…

Do you think something is wrong in the Dev dataset? How do I make sure?

othiele · October 19, 2020, 11:38am

It is hard to tell as you don’t give the data for the 2nd run. I would suggest you put the train/dev data into a sheet and plot the curves. This would give us more info on overfitting.

You listen to lots of samples and you try to find out what you dataset is like.

Tejas_Shah · October 19, 2020, 1:13pm

Here is the graph and the loss data for 15 epochs that I ran.

Epoch	Training	Dev
1	31.696602	29.279222
2	25.797112	26.403398
3	23.697474	23.351679
4	22.394458	22.789643
5	21.499197	21.611039
6	20.802281	20.225757
7	20.255105	20.011726
8	19.822383	19.445061
9	19.45445	18.534895
10	19.162925	18.120503
11	18.875543	17.938794
12	18.677697	17.756683
13	18.491033	17.55612
14	18.328995	17.460174
15	18.157414	17.92518

Training Loss

Hope this helps.

othiele · October 19, 2020, 2:13pm

Yep, this doesn’t look too bad. You could try how the model at epoch 9/10 compares as the dev doesn’t decrease much after that.

@lissyx and everybody else, what do you think?

lissyx · October 19, 2020, 2:22pm

“samples” is not an actionable measure

loss value is meaningless, only its evolution wrt dev/tain makes sense

trained from scratch ? transfer-learning ?

define faulty. what test data do you refer to? you only show one example above.

what scorer? our english one? one you built?

Could you please use appropriate format for sharing that? Plain text is much better than … PDF.

lissyx · October 19, 2020, 2:22pm

Your training start from low compared to what you achieve, my guess is that your model is not learning a lot. But there is a lot of missing context, as replied previously.

Tejas_Shah · October 19, 2020, 4:19pm

Hi,
I did not understood all your questions… Whatever I could understand, here is my reply.
trained from scratch ? transfer-learning ?
I have used Checkpoint obtained from training based on LibriSpeech data.
Define Faulty?
Testing gives following output where we can see that it is able to infer to a good extent.

Test on /home/ubuntu/DataSet/tariniTest3.csv - WER: 0.306499, CER: 0.133683, loss: inf

Best WER:

WER: 0.000000, CER: 0.000000, loss: 14.413833

wav: file:///home/ubuntu/DataSet/EnglishB743951-F-21_290.wav

src: “the door will not open the lock must be out of order”

res: “the door will not open the lock must be out of order”

WER: 0.000000, CER: 0.000000, loss: 12.270652

wav: file:///home/ubuntu/DataSet/EnglishB741866-F-19_392.wav

src: “seventy six thousand nine hundred seventy eight”

res: “seventy six thousand nine hundred seventy eight”

WER: 0.000000, CER: 0.029412, loss: 11.592957

wav: file:///home/ubuntu/DataSet/EnglishB741827-M-26_321.wav

src: "what is the pressure in st albert "

res: “what is the pressure in st albert”

WER: 0.000000, CER: 0.000000, loss: 11.370380

wav: file:///home/ubuntu/DataSet/EnglishB743978-M-26_324.wav

src: “i don’t want to listen to slow songs by bogdan bacanu”

res: “i don’t want to listen to slow songs by bogdan bacanu”

WER: 0.000000, CER: 0.000000, loss: 11.101262

wav: file:///home/ubuntu/DataSet/EnglishB743972-M-21_381.wav

src: “one three seven eight five five zero eight one zero”

res: “one three seven eight five five zero eight one zero”

Median WER:

WER: 0.272727, CER: 0.096154, loss: 23.275650

wav: file:///home/ubuntu/DataSet/EnglishB745900-M-18_210.wav

src: “but the manure contents may not be what a crop needs”

res: “but the manu contents may not be what a croppleds”

WER: 0.272727, CER: 0.176471, loss: 23.255077

wav: file:///home/ubuntu/DataSet/EnglishB745900-M-18_93.wav

src: “it was a privilege that left him with much to prove”

res: “it was a previlated that left him we must to prove”

WER: 0.272727, CER: 0.075472, loss: 23.253437

wav: file:///home/ubuntu/DataSet/EnglishB745871-M-20_124.wav

src: “hire your team and proceed to choose what’s in or out”

res: “ir your team and proceed to choose what’s in ornot”

WER: 0.272727, CER: 0.101695, loss: 23.253134

wav: file:///home/ubuntu/DataSet/EnglishB743954-F-19_127.wav

src: “i beat dick in billiards because he gave me tremendous odds”

res: “i beat taken billiards because he gaves me tremendous odds”

WER: 0.272727, CER: 0.107143, loss: 23.244072

wav: file:///home/ubuntu/DataSet/EnglishB745134-M-18_041.wav

src: “once fuel cells have been perfected we could all own one”

res: “once fuel says have been perfected be could all on one”

Worst WER:

WER: 2.200000, CER: 2.050000, loss: 158.543167

wav: file:///home/ubuntu/DataSet/EnglishB742929-F-19_176.wav

src: “he’s a child like me”

res: “heis the child like me and hes sick with fear at loss i her”

WER: 2.250000, CER: 2.714286, loss: 191.569550

wav: file:///home/ubuntu/DataSet/EnglishB742930-F-19_002.wav

src: "oh by the way "

res: “hope by the way jeff did you go to grees on purpose”

WER: 2.250000, CER: 1.565217, loss: 186.429092

wav: file:///home/ubuntu/DataSet/EnglishB741843-F-22_097.wav

src: "a stranger for example "

res: “stranger for example mike cup in front of his in frafic”

WER: 2.250000, CER: 1.652174, loss: 157.486099

wav: file:///home/ubuntu/DataSet/EnglishB742890-M-22_164.wav

src: "his object was however "

res: “he’s object was however to be wetorious and not to win money”

WER: 2.285714, CER: 1.638889, loss: 321.967621

wav: file:///home/ubuntu/DataSet/EnglishB743011-F-24_363.wav

src: “remind me to do laundry every sunday”

res: "remind me to do loun nal ha as regon xlesoe a a e or e e e a a a "

What scorer?
I tried with scorer which I built.
Attaching PDF?
Unfortunately, site does not give me facility to attach and upload a txt file. And instead of pasting long text, I thought to upload it.

othiele · October 19, 2020, 5:29pm

Ah, you are transferring from another model, you could have said that earlier. Your learning rate is too high. Search here for transfer learning. And try training from scratch without another model, you might get better results.

lissyx · October 19, 2020, 5:51pm

From your testing output, it seems you just fall into thé set of “worst WER”. Your file seems to have characteristics that makes it hard to infer. I gave no idea why and i have no time to investigate.

Tejas_Shah · November 19, 2020, 8:23am

Hello,
I tried transfer learning also from the libriSpeech model. However, still the result is same. As soon as the number of epochs increases, the inference output on another set of data becomes worse.

othiele · November 19, 2020, 9:32am

You didn’t follow up on lissyx’s comment that you simply have very strange material. Judging from the amount you have, you should get somewhat good results.

How long is your material max, min, median? What is it like?

What learning rate did you use? How did you build your scorer? Why not transfer from the released English model?

Tejas_Shah · November 19, 2020, 10:19am

If I use scorer (built from librispeech data by me as well as the official scorer - deepspeech-0.8.2-models.scorer), the result is worse than the one without it.
I user learning rate as 0.0001.

othiele · November 19, 2020, 10:28am

Please start using the search function in this forum and check other learning rates for transfer learning or fine tuning. Same for a custom scorer.

Not able to infer specific set of audio

I Training epoch 1… I Finished training epoch 1 - loss: 25.797112 I Validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv… I Finished validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 26.403398 I Saved new best validating model with loss 26.403398

I Training epoch 2… I Finished training epoch 2 - loss: 23.697474 I Validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv… I Finished validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 23.351679 I Saved new best validating model with loss 23.351679

I Training epoch 3… I Finished training epoch 3 - loss: 22.394458 I Validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv… I Finished validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 22.789643 I Saved new best validating model with loss 22.789643

I Training epoch 4… I Finished training epoch 4 - loss: 21.499197 I Validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv… I Finished validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 21.611039 I Saved new best validating model with loss 21.611039

Test on /home/ubuntu/DataSet/tariniTest3.csv - WER: 0.306499, CER: 0.133683, loss: inf

Best WER:

Median WER:

Worst WER:

I Training epoch 1…
I Finished training epoch 1 - loss: 25.797112
I Validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 26.403398
I Saved new best validating model with loss 26.403398

I Training epoch 2…
I Finished training epoch 2 - loss: 23.697474
I Validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 23.351679
I Saved new best validating model with loss 23.351679

I Training epoch 3…
I Finished training epoch 3 - loss: 22.394458
I Validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 22.789643
I Saved new best validating model with loss 22.789643

I Training epoch 4…
I Finished training epoch 4 - loss: 21.499197
I Validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 21.611039
I Saved new best validating model with loss 21.611039