Different results from microphone and recorded audio

I am using the Deep Speech 0.6.0 pre built model.
I am trying to get the output from both microphone(deep speech examples) and recorded audio having the same input. But there is big difference in the output i.e recorded audio is giving me WER of 22% while output from microphone is giving me WER of more than 50%.

Why there is big difference between the output while the hyper parameters are same?


Its because Deepspeech model is trained on clean audio and not in noisy environment. Your microphone samples may have a bit noisy speech. Thats why the difference in WERs.

Thanks for reply.
But the recorded audio has been recorded from the same microphone from which I am transcribing instantly.

P.S.- They both have same noise level and same speech but different WER.

can you share a sample of both?

Please use 0.6.1

Please share more details on exactly what you test here.

@lissyx @nikhilshirwandkar
Thanks for quick reply.

Code used for recorded audio-
deepspeech --model /home/piyush/deepspeech-0.6.0-models/output_graph.pb --lm /home/piyush/deepspeech-0.6.0-models/lm.binary --trie //home/piyush/deepspeech-0.6.0-models/trie --audio /home/piyush/Downloads/a_117.wav

Here is link to recorded audio-https://drive.google.com/file/d/1B4hrqIDH5ge7A0hDxM3-Edw-zk5Pm8-c/view?usp=sharing

And its transcription is
"i related a database had used by almost all organization for various task from me made managing and racking her hidebound of information to organizing and processing tendencies it is one of the first concept bear thought in calling school

Code for audio from microphone -

python mic_vad_streaming.py --vad_aggressiveness 3 --savewav /home/piyush/saveaudio --model /home/piyush/deepspeech-0.6.0-models/output_graph.pbmm -lm /home/piyush/deepspeech-0.6.0-models/lm.binary --trie /home/piyush/deepspeech-0.6.0-models/trie

Link to audio from microphone(First Line only)-https://drive.google.com/file/d/1bY1se1s4h86ZVwr2eo_SgizP4D6OvyVW/view?usp=sharing

Transcription of audio with microphone-
“let these had used by almost all over the nitrogen for various talk from managing and taking er he was amount of information to war an izing and possessing pandects”

While the original text is
" Relational databases are used by almost all organizations for various tasks – from managing and tracking a huge amount of information to organizing and processing transactions. It’s one of the first concepts we are taught in coding school."

Now as we can see there is big difference in result of both of them.

Will try this too.

But I think there should not be big difference between these two.
What’s your opinion about this and why does this happens so?

The way these VAD libraries work is they chop up the audio, and the way they do it seems give DeepSpeech some troubles. My theory is that the VAD is making a chop a tad too late or too early, so then DeepSpeech is receiving either half a word at the beginning, or half a word at the end. Whereas if you’re not streaming, there’s no chopping of the audio happening at all, so the results are better.

I’ve been tweaking the VAD code I use in NodeJS, and adding a bit of a buffer at the start and end of the audio data that gets fed into DeepSpeech, and the results do seem to improve. I imagine the same technique could be made in the Python code.


The audio with microphone has lot of plosives while you re pronouncing( the windy blows/noise in the audio) Try avoiding that, it might get better.

Also, try for getting inference on both these files directly, without VAD.
Because VAD creates problem at end of the words when it chops as @dsteinman stated.

@dsteinman @nikhilshirwandkar
Thanks for the responses.

Yes,it was most probably due VAD which chops the words.


Blockquote The audio with microphone has lot of plosives while you re pronouncing( the windy blows/noise in the audio)

Yes but the voice doesn’t contains huge noise. But sometimes , it is having very less noise .At that moment , it should transcript correctly.

Hello, nikhilshirwandkar! Could you please say what is the data which was used to train DeepSpeech 0.6.1 release ? How can I train a model and which datasets should be use to make it more robust epecially when the voice quality close to phone calls?

I have tried transfer learning on the phone call audio, though the data was less around 10 hours, which was actually recorded at our side. But could see the model learning for that data. I had used transferlearning2 branch. Do check it out!

My microphone volume was too high, and it gave me really bad results. Try changing the gain of your microphone.

1 Like