Not able to infer specific set of audio

Hello,
I trained a model on Indian English with almost 650000 samples and around 70K test samples. Following parameters were used while training the model.
python3 -u DeepSpeech.py --noshow_progressbar
–automatic_mixed_precision True
–train_files /home/ubuntu/DataSet/Data/TrainIndianEng.csv
–train_batch_size 8
–dev_files /home/ubuntu/DataSet/Data/DevIndianEng.csv
–dev_batch_size 8
–test_files /home/ubuntu/DataSet/testData/TestIndianEng.csv
–test_batch_size 8
–max_to_keep 5
–checkpoint_dir /home/ubuntu/deepspeech/checkDir
–learning_rate 0.0001
–epochs 20
–dropout_rate 0.20
–train_cudnn True
–use_allow_growth True

The above training completes with the loss of around 13. The tests end at median of 0.30 WER.
Although the inference works correctly on the test data set, if I try to infer some other wav file of Indian English of around 5/6 minutes of voice, the output is not complete. In fact, it just outputs around 1-3 words.
Following is the output of the same.

TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:44:12.618432: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.53s.
Running inference.
> wark ter vts cuta how an ee i o a
Inference took 292.311s for 343.620s audio file.

If I infer the same wav file with a US English model which was built with LibriSpeech data, it is able to infer the following.

TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:50:32.907259: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.49s.
Running inference.
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:50:32.907259: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.49s.
Running inference.
> pr yo bo on migi te lon lettit at cand tack ther hok an i har bereand i din’t od thee bi hu banier opla bort gu il o alin a gol hluoked a bea back to me ty an’to beok dor shoulpangeni kit thim omen hell ah do il ten mymkan abit e of to contact numberr man an do te opi tine it clolitle bye no it nine nine to the opi ocan i nlang do dittle qui b wenined bortho wiel the nine pord hied ton the don de temomen that mymly no man rubing angi and you ae i col a dyu beal hill were yer to tonl ment i me i ia he conon hold pam minute nen dod on the ank you in enoo a a ao a aahaa a aaao oh ao iah i ket ot tink on tigh came up ahd ma tayk hume pie mi’ bare yin ol heu in doing a bo herkin a mong then etdo ti wout gon a ho with o lod you may thinkin than ik you ai then w im a ame be ma po ii aa a aaaah a ioohho iaaaahaah a iaoohal bonblaooaaahha ha io ohapo aaa a ant o m a hahen waa eam behol heha lao lo i u abat a lom at it u ethat omaat ecanilt mun e bha i apon a gat en o ong hanno yet mot ah shebandinie woul not be bushe dege te caldigt not cand and interco bac with tan happinar the youre ho wone me n en olr tang i do on har han dwelver no he lar gone you my you a bhe get back to me at do wete mi won obed them the un thegod a hen that by wart you gould be on the poo them let it irn thad bol in lack bove beco do ot am brobema and he have e it yo logt to gen tack to he has bo wide any at the adban by and men it martete on of get the crited din lacko di her are ight marm oh under ten bud he wh in lordy licatie pold ado know gar you not id levlen you knongoun the bold to vilivin ah maeor hap he provigion turchen the deter inn is la yeu to harbauck the tan ort and our helilad orere for to inte were heu oh den flive o’clod or car ith an him u lo a ther an you orer we my by number i had ut on he mudled be te cod then unber in that the hillinof ant oud ten o to you yeuv be the the onn nor be a bego then the goi lot my you tae don min un lork ogornly go bev er than thof te themer bat my un word od epear oi ye ye el can have you numbero anby b yor night at no do ork to lyny drew teto thau belon dola ve youd tem ete noto leven thick in’tn there notle ola in an the goden amon again emp on nine your d dot no niabe war my la do ne or noboe london heven becleat portect i go do now be o wit yow acan a giv aco batmit before by book loco care he mantei ah my ning it hadh lord let i atelot be i ar eago bang lok my my hammam pank you thank it le hodetn ir oren
> Inference took 292.779s for 343.620s audio file.

The output with LibriSpeech model is not correct but atleast it was able to output a lot of text which I expect with an audio of 5-6 minutes.
I feel the audio file should be correct as was shown with LibriSpeech example. However, I dont understand the reason for bad output with Indian English voice samples.
Can someone help to debug this situation further? The Indian English model can not be faulty as it is able to infer the test data. The audio file can not be faulty as US English model was able to infer lot of text.
Is there a way to debug the situation?

  • Inference without scorer to see raw output
  • It is unclear what you used for dev, this should be great material
  • Use a higher dropout of 0.3 or 0.4
  • You may be overfitting, 20 epochs should give you somewhat good results for good input material
  • Use higher batch size, which makes the training just faster not better

Hi Olaf, Thanks for a reply.
The inference shown in my earlier message was without scorer. I tried to use scorer but it does not improve the situation.
Dev dataset is also from the same dataset used for training and testing. Attaching herewith is thmodel_build_IndianEng_libri.pdf (33.8 KB) e output of the training process. I have used 20 epochs for training.

Again:

  • What do you use for dev and how much?
  • Dropout is low depending on the material, what do you use?

Here are the datasets.
Train Dataset: 650K samples
Dev Dataset: 60K Samples
Test Dataset: 70K Samples

Dropout used is 0.2.

Not too bad. Check that the dev set does not contain many problematic chunks and use a higher dropout. I would go with 0.3 or higher.

As for the WER I am not sure whether you should use that without a scorer as you would need the exact words. Your CER looks good. You could continue training and see whether it improves or you retrain with the higher dropout.

Thanks @othiele.
Based on your suggestions, I restarted the training from scratch with 0.4 as dropout rate.
I trained the model in two phases - in the first phase, I tarined for 5 epochs and then the next one ran for 10 epochs. Surprisingly, the results after the pass1 is much better than the result after pass2. Not sure why the inference results are worsening with more training.
Here is the output of each pass:
Pass1:
I Training epoch 0…
I Finished training epoch 0 - loss: 31.696602
I Validating epoch 0 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 0 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 29.279222

I Training epoch 1…
I Finished training epoch 1 - loss: 25.797112
I Validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 1 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 26.403398
I Saved new best validating model with loss 26.403398

I Training epoch 2…
I Finished training epoch 2 - loss: 23.697474
I Validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 2 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 23.351679
I Saved new best validating model with loss 23.351679

I Training epoch 3…
I Finished training epoch 3 - loss: 22.394458
I Validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 3 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 22.789643
I Saved new best validating model with loss 22.789643

I Training epoch 4…
I Finished training epoch 4 - loss: 21.499197
I Validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv…
I Finished validating epoch 4 on /home/ubuntu/DataSet/DevIndianEng.csv - loss: 21.611039
I Saved new best validating model with loss 21.611039

Pass2 ended with loss of 17.92…

Do you think something is wrong in the Dev dataset? How do I make sure?

It is hard to tell as you don’t give the data for the 2nd run. I would suggest you put the train/dev data into a sheet and plot the curves. This would give us more info on overfitting.

You listen to lots of samples and you try to find out what you dataset is like.

Here is the graph and the loss data for 15 epochs that I ran.

Epoch Training Dev
1 31.696602 29.279222
2 25.797112 26.403398
3 23.697474 23.351679
4 22.394458 22.789643
5 21.499197 21.611039
6 20.802281 20.225757
7 20.255105 20.011726
8 19.822383 19.445061
9 19.45445 18.534895
10 19.162925 18.120503
11 18.875543 17.938794
12 18.677697 17.756683
13 18.491033 17.55612
14 18.328995 17.460174
15 18.157414 17.92518

Training Loss

Hope this helps.

Yep, this doesn’t look too bad. You could try how the model at epoch 9/10 compares as the dev doesn’t decrease much after that.

@lissyx and everybody else, what do you think?

“samples” is not an actionable measure

loss value is meaningless, only its evolution wrt dev/tain makes sense

trained from scratch ? transfer-learning ?

define faulty. what test data do you refer to? you only show one example above.

what scorer? our english one? one you built?

Could you please use appropriate format for sharing that? Plain text is much better than … PDF.

Your training start from low compared to what you achieve, my guess is that your model is not learning a lot. But there is a lot of missing context, as replied previously.

Hi,
I did not understood all your questions… Whatever I could understand, here is my reply.
trained from scratch ? transfer-learning ?
I have used Checkpoint obtained from training based on LibriSpeech data.
Define Faulty?
Testing gives following output where we can see that it is able to infer to a good extent.

Test on /home/ubuntu/DataSet/tariniTest3.csv - WER: 0.306499, CER: 0.133683, loss: inf

Best WER:

WER: 0.000000, CER: 0.000000, loss: 14.413833

  • wav: file:///home/ubuntu/DataSet/EnglishB743951-F-21_290.wav
  • src: “the door will not open the lock must be out of order”
  • res: “the door will not open the lock must be out of order”

WER: 0.000000, CER: 0.000000, loss: 12.270652

  • wav: file:///home/ubuntu/DataSet/EnglishB741866-F-19_392.wav
  • src: “seventy six thousand nine hundred seventy eight”
  • res: “seventy six thousand nine hundred seventy eight”

WER: 0.000000, CER: 0.029412, loss: 11.592957

  • wav: file:///home/ubuntu/DataSet/EnglishB741827-M-26_321.wav
  • src: "what is the pressure in st albert "
  • res: “what is the pressure in st albert”

WER: 0.000000, CER: 0.000000, loss: 11.370380

  • wav: file:///home/ubuntu/DataSet/EnglishB743978-M-26_324.wav
  • src: “i don’t want to listen to slow songs by bogdan bacanu”
  • res: “i don’t want to listen to slow songs by bogdan bacanu”

WER: 0.000000, CER: 0.000000, loss: 11.101262

  • wav: file:///home/ubuntu/DataSet/EnglishB743972-M-21_381.wav
  • src: “one three seven eight five five zero eight one zero”
  • res: “one three seven eight five five zero eight one zero”

Median WER:

WER: 0.272727, CER: 0.096154, loss: 23.275650

  • wav: file:///home/ubuntu/DataSet/EnglishB745900-M-18_210.wav
  • src: “but the manure contents may not be what a crop needs”
  • res: “but the manu contents may not be what a croppleds”

WER: 0.272727, CER: 0.176471, loss: 23.255077

  • wav: file:///home/ubuntu/DataSet/EnglishB745900-M-18_93.wav
  • src: “it was a privilege that left him with much to prove”
  • res: “it was a previlated that left him we must to prove”

WER: 0.272727, CER: 0.075472, loss: 23.253437

  • wav: file:///home/ubuntu/DataSet/EnglishB745871-M-20_124.wav
  • src: “hire your team and proceed to choose what’s in or out”
  • res: “ir your team and proceed to choose what’s in ornot”

WER: 0.272727, CER: 0.101695, loss: 23.253134

  • wav: file:///home/ubuntu/DataSet/EnglishB743954-F-19_127.wav
  • src: “i beat dick in billiards because he gave me tremendous odds”
  • res: “i beat taken billiards because he gaves me tremendous odds”

WER: 0.272727, CER: 0.107143, loss: 23.244072

  • wav: file:///home/ubuntu/DataSet/EnglishB745134-M-18_041.wav
  • src: “once fuel cells have been perfected we could all own one”
  • res: “once fuel says have been perfected be could all on one”

Worst WER:

WER: 2.200000, CER: 2.050000, loss: 158.543167

  • wav: file:///home/ubuntu/DataSet/EnglishB742929-F-19_176.wav
  • src: “he’s a child like me”
  • res: “heis the child like me and hes sick with fear at loss i her”

WER: 2.250000, CER: 2.714286, loss: 191.569550

  • wav: file:///home/ubuntu/DataSet/EnglishB742930-F-19_002.wav
  • src: "oh by the way "
  • res: “hope by the way jeff did you go to grees on purpose”

WER: 2.250000, CER: 1.565217, loss: 186.429092

  • wav: file:///home/ubuntu/DataSet/EnglishB741843-F-22_097.wav
  • src: "a stranger for example "
  • res: “stranger for example mike cup in front of his in frafic”

WER: 2.250000, CER: 1.652174, loss: 157.486099

  • wav: file:///home/ubuntu/DataSet/EnglishB742890-M-22_164.wav
  • src: "his object was however "
  • res: “he’s object was however to be wetorious and not to win money”

WER: 2.285714, CER: 1.638889, loss: 321.967621

  • wav: file:///home/ubuntu/DataSet/EnglishB743011-F-24_363.wav
  • src: “remind me to do laundry every sunday”
  • res: "remind me to do loun nal ha as regon xlesoe a a e or e e e a a a "

What scorer?
I tried with scorer which I built.
Attaching PDF?
Unfortunately, site does not give me facility to attach and upload a txt file. And instead of pasting long text, I thought to upload it.

Ah, you are transferring from another model, you could have said that earlier. Your learning rate is too high. Search here for transfer learning. And try training from scratch without another model, you might get better results.

From your testing output, it seems you just fall into thé set of “worst WER”. Your file seems to have characteristics that makes it hard to infer. I gave no idea why and i have no time to investigate.