Hello,
I trained a model on Indian English with almost 650000 samples and around 70K test samples. Following parameters were used while training the model.
python3 -u DeepSpeech.py --noshow_progressbar
–automatic_mixed_precision True
–train_files /home/ubuntu/DataSet/Data/TrainIndianEng.csv
–train_batch_size 8
–dev_files /home/ubuntu/DataSet/Data/DevIndianEng.csv
–dev_batch_size 8
–test_files /home/ubuntu/DataSet/testData/TestIndianEng.csv
–test_batch_size 8
–max_to_keep 5
–checkpoint_dir /home/ubuntu/deepspeech/checkDir
–learning_rate 0.0001
–epochs 20
–dropout_rate 0.20
–train_cudnn True
–use_allow_growth True
The above training completes with the loss of around 13. The tests end at median of 0.30 WER.
Although the inference works correctly on the test data set, if I try to infer some other wav file of Indian English of around 5/6 minutes of voice, the output is not complete. In fact, it just outputs around 1-3 words.
Following is the output of the same.
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:44:12.618432: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.53s.
Running inference.
> wark ter vts cuta how an ee i o a
Inference took 292.311s for 343.620s audio file.
If I infer the same wav file with a US English model which was built with LibriSpeech data, it is able to infer the following.
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:50:32.907259: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.49s.
Running inference.
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-10-15 06:50:32.907259: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.49s.
Running inference.
> pr yo bo on migi te lon lettit at cand tack ther hok an i har bereand i din’t od thee bi hu banier opla bort gu il o alin a gol hluoked a bea back to me ty an’to beok dor shoulpangeni kit thim omen hell ah do il ten mymkan abit e of to contact numberr man an do te opi tine it clolitle bye no it nine nine to the opi ocan i nlang do dittle qui b wenined bortho wiel the nine pord hied ton the don de temomen that mymly no man rubing angi and you ae i col a dyu beal hill were yer to tonl ment i me i ia he conon hold pam minute nen dod on the ank you in enoo a a ao a aahaa a aaao oh ao iah i ket ot tink on tigh came up ahd ma tayk hume pie mi’ bare yin ol heu in doing a bo herkin a mong then etdo ti wout gon a ho with o lod you may thinkin than ik you ai then w im a ame be ma po ii aa a aaaah a ioohho iaaaahaah a iaoohal bonblaooaaahha ha io ohapo aaa a ant o m a hahen waa eam behol heha lao lo i u abat a lom at it u ethat omaat ecanilt mun e bha i apon a gat en o ong hanno yet mot ah shebandinie woul not be bushe dege te caldigt not cand and interco bac with tan happinar the youre ho wone me n en olr tang i do on har han dwelver no he lar gone you my you a bhe get back to me at do wete mi won obed them the un thegod a hen that by wart you gould be on the poo them let it irn thad bol in lack bove beco do ot am brobema and he have e it yo logt to gen tack to he has bo wide any at the adban by and men it martete on of get the crited din lacko di her are ight marm oh under ten bud he wh in lordy licatie pold ado know gar you not id levlen you knongoun the bold to vilivin ah maeor hap he provigion turchen the deter inn is la yeu to harbauck the tan ort and our helilad orere for to inte were heu oh den flive o’clod or car ith an him u lo a ther an you orer we my by number i had ut on he mudled be te cod then unber in that the hillinof ant oud ten o to you yeuv be the the onn nor be a bego then the goi lot my you tae don min un lork ogornly go bev er than thof te themer bat my un word od epear oi ye ye el can have you numbero anby b yor night at no do ork to lyny drew teto thau belon dola ve youd tem ete noto leven thick in’tn there notle ola in an the goden amon again emp on nine your d dot no niabe war my la do ne or noboe london heven becleat portect i go do now be o wit yow acan a giv aco batmit before by book loco care he mantei ah my ning it hadh lord let i atelot be i ar eago bang lok my my hammam pank you thank it le hodetn ir oren
> Inference took 292.779s for 343.620s audio file.
The output with LibriSpeech model is not correct but atleast it was able to output a lot of text which I expect with an audio of 5-6 minutes.
I feel the audio file should be correct as was shown with LibriSpeech example. However, I dont understand the reason for bad output with Indian English voice samples.
Can someone help to debug this situation further? The Indian English model can not be faulty as it is able to infer the test data. The audio file can not be faulty as US English model was able to infer lot of text.
Is there a way to debug the situation?