Hi,
I am using the Deep Speech 0.6.0 pre built model.
I am trying to get the output from both microphone(deep speech examples) and recorded audio having the same input. But there is big difference in the output i.e recorded audio is giving me WER of 22% while output from microphone is giving me WER of more than 50%.
Why there is big difference between the output while the hyper parameters are same?
Its because Deepspeech model is trained on clean audio and not in noisy environment. Your microphone samples may have a bit noisy speech. Thats why the difference in WERs.
Code used for recorded audio- deepspeech --model /home/piyush/deepspeech-0.6.0-models/output_graph.pb --lm /home/piyush/deepspeech-0.6.0-models/lm.binary --trie //home/piyush/deepspeech-0.6.0-models/trie --audio /home/piyush/Downloads/a_117.wav
And its transcription is
"i related a database had used by almost all organization for various task from me made managing and racking her hidebound of information to organizing and processing tendencies it is one of the first concept bear thought in calling school
"
Transcription of audio with microphone-
“let these had used by almost all over the nitrogen for various talk from managing and taking er he was amount of information to war an izing and possessing pandects”
While the original text is
" Relational databases are used by almost all organizations for various tasks – from managing and tracking a huge amount of information to organizing and processing transactions. It’s one of the first concepts we are taught in coding school."
Now as we can see there is big difference in result of both of them.
Will try this too.
But I think there should not be big difference between these two.
What’s your opinion about this and why does this happens so?
The way these VAD libraries work is they chop up the audio, and the way they do it seems give DeepSpeech some troubles. My theory is that the VAD is making a chop a tad too late or too early, so then DeepSpeech is receiving either half a word at the beginning, or half a word at the end. Whereas if you’re not streaming, there’s no chopping of the audio happening at all, so the results are better.
I’ve been tweaking the VAD code I use in NodeJS, and adding a bit of a buffer at the start and end of the audio data that gets fed into DeepSpeech, and the results do seem to improve. I imagine the same technique could be made in the Python code.
The audio with microphone has lot of plosives while you re pronouncing( the windy blows/noise in the audio) Try avoiding that, it might get better.
Also, try for getting inference on both these files directly, without VAD.
Because VAD creates problem at end of the words when it chops as @dsteinman stated.
Hello, nikhilshirwandkar! Could you please say what is the data which was used to train DeepSpeech 0.6.1 release ? How can I train a model and which datasets should be use to make it more robust epecially when the voice quality close to phone calls?
I have tried transfer learning on the phone call audio, though the data was less around 10 hours, which was actually recorded at our side. But could see the model learning for that data. I had used transferlearning2 branch. Do check it out!