We have total around 2300-2400 hrs of data in Hinglish language (80-85% Hindi and 15-20 % English).
Our training data audio is split into chunks of 1.5 sec to 10 sec.
DeepSpeech training command:
./DeepSpeech.py --train_files data/train.csv --dev_files data/dev.csv --test_files data/test.csv --alphabet_config_path data/alphabet.txt --checkpoint_dir ~/checkpoint_dir --epochs 20 --export_dir ~/export_dir --lm_binary_path data/lm.binary --lm_trie_path data/trie --report_count 10 --show_progressbar true --train_batch_size 48 --dev_batch_size 48 --test_batch_size 48 --learning_rate 0.0001 --dropout_rate 0.15 --n_hidden 2048 --audio_sample_rate 8000
We do prediction on whole audio call and the chunks of the same audio calls (split on vad).
Problems that We face:
- vad does not remove noise properly?
2.some new words like ok, haan, sir etc are added at the begining and end of the chunk predicted output. - some words are missing in between the chunks?
- words are missing when inference on whole audio call but extra word append is not there in this case?
Examples:
1.
orig: kya meri baat ******(name) dutt(lname) ji se ho rahi hai kaise ho aap sir
predicted_with_lm: meri baat jo date ho rahi hai jaise
acoustic output: temse eri baas jobeste ho rahi hai kaise ape
Orig: app open kroge vhi pr sir top pr naam show ho rha hoga and jahan pe naam show ho rha hoga vahan pr teen dots bane hoge theek hai sir sir unka jo number hai wo switch off aa raha hai
predicted_with_lm: aap open karoge to ek baar top pe aapka naam se ho raha hoga theek hai sir india mart ki aapne yahan pe naam se naam ka kar paneer steamer hai na usko raha hai ki inidible
acoustic output: aap open karaoge oek baar top pe aapka naam se oraha hoga theek hai sir indiamart ki aapme aanja pe aam s oga naam ka ka ee ar opan nme ki sir sir to baje number hai na usseco faraha hai ska sirnai in.
3.
orig: This call may be recorded traininig and quality purpose
predicted_with_lm: iis call we will recorded ten quality purpose
acoustic_output: this call we will recorded ten quanity polpose
orig: and your service will be ended upto **** February 2022 ok and your service will be, and you will get twenty ****** per week and all ******* are lapsable and your service will be updated in ten to fifteen days sir
predicted_with_lm: and your service and will be for a two thousand twenty two ok sir your service will be you will get antimalarial you are the table and your service activation ten to fifteen days
acoustic_output: and your service banded wuill be ****** two thousand tenty two ok r your sirthis til be ou will get tente byi cer de aldyou sar let taple and your services ectivation ten to fifteen days
orig: Third party team will visit for physical verification after taking appointment from yo ok sir
predicted_with_lm: third party team will be it or scale verification after the an content for you ok sir
acoustic_output: thir parti te wil betit or isical verification after dekain aont mont fom you ok sir
Any help appreciated.
Thanks