Different results from microphone and recorded audio

Piyush_Gupta · February 9, 2020, 12:49pm

Hi,
I am using the Deep Speech 0.6.0 pre built model.
I am trying to get the output from both microphone(deep speech examples) and recorded audio having the same input. But there is big difference in the output i.e recorded audio is giving me WER of 22% while output from microphone is giving me WER of more than 50%.

Why there is big difference between the output while the hyper parameters are same?

nikhilshirwandkar · February 10, 2020, 4:50am

Hello,

Its because Deepspeech model is trained on clean audio and not in noisy environment. Your microphone samples may have a bit noisy speech. Thats why the difference in WERs.

Piyush_Gupta · February 10, 2020, 5:09am

Thanks for reply.
But the recorded audio has been recorded from the same microphone from which I am transcribing instantly.

P.S.- They both have same noise level and same speech but different WER.

nikhilshirwandkar · February 10, 2020, 5:11am

can you share a sample of both?

lissyx · February 10, 2020, 9:28am

Please use 0.6.1

Please share more details on exactly what you test here.

Piyush_Gupta · February 10, 2020, 8:04pm

@lissyx @nikhilshirwandkar
Thanks for quick reply.

Code used for recorded audio-
deepspeech --model /home/piyush/deepspeech-0.6.0-models/output_graph.pb --lm /home/piyush/deepspeech-0.6.0-models/lm.binary --trie //home/piyush/deepspeech-0.6.0-models/trie --audio /home/piyush/Downloads/a_117.wav

Here is link to recorded audio-https://drive.google.com/file/d/1B4hrqIDH5ge7A0hDxM3-Edw-zk5Pm8-c/view?usp=sharing

And its transcription is
"i related a database had used by almost all organization for various task from me made managing and racking her hidebound of information to organizing and processing tendencies it is one of the first concept bear thought in calling school
"

Code for audio from microphone -

python mic_vad_streaming.py --vad_aggressiveness 3 --savewav /home/piyush/saveaudio --model /home/piyush/deepspeech-0.6.0-models/output_graph.pbmm -lm /home/piyush/deepspeech-0.6.0-models/lm.binary --trie /home/piyush/deepspeech-0.6.0-models/trie

Link to audio from microphone(First Line only)-https://drive.google.com/file/d/1bY1se1s4h86ZVwr2eo_SgizP4D6OvyVW/view?usp=sharing

Transcription of audio with microphone-
“let these had used by almost all over the nitrogen for various talk from managing and taking er he was amount of information to war an izing and possessing pandects”

While the original text is
" Relational databases are used by almost all organizations for various tasks – from managing and tracking a huge amount of information to organizing and processing transactions. It’s one of the first concepts we are taught in coding school."

Now as we can see there is big difference in result of both of them.

Will try this too.

But I think there should not be big difference between these two.
What’s your opinion about this and why does this happens so?

dsteinman · February 10, 2020, 10:19pm

The way these VAD libraries work is they chop up the audio, and the way they do it seems give DeepSpeech some troubles. My theory is that the VAD is making a chop a tad too late or too early, so then DeepSpeech is receiving either half a word at the beginning, or half a word at the end. Whereas if you’re not streaming, there’s no chopping of the audio happening at all, so the results are better.

I’ve been tweaking the VAD code I use in NodeJS, and adding a bit of a buffer at the start and end of the audio data that gets fed into DeepSpeech, and the results do seem to improve. I imagine the same technique could be made in the Python code.

nikhilshirwandkar · February 11, 2020, 9:55am

The audio with microphone has lot of plosives while you re pronouncing( the windy blows/noise in the audio) Try avoiding that, it might get better.

Also, try for getting inference on both these files directly, without VAD.
Because VAD creates problem at end of the words when it chops as @dsteinman stated.

Piyush_Gupta · February 11, 2020, 7:31pm

@dsteinman @nikhilshirwandkar
Thanks for the responses.

Yes,it was most probably due VAD which chops the words.

@nikhilshirwandkar

Blockquote The audio with microphone has lot of plosives while you re pronouncing( the windy blows/noise in the audio)

Yes but the voice doesn’t contains huge noise. But sometimes , it is having very less noise .At that moment , it should transcript correctly.

Aleksei_Smoliarchuk · February 17, 2020, 12:25pm

Hello, nikhilshirwandkar! Could you please say what is the data which was used to train DeepSpeech 0.6.1 release ? How can I train a model and which datasets should be use to make it more robust epecially when the voice quality close to phone calls?

nikhilshirwandkar · February 17, 2020, 2:17pm

I have tried transfer learning on the phone call audio, though the data was less around 10 hours, which was actually recorded at our side. But could see the model learning for that data. I had used transferlearning2 branch. Do check it out!

mbonsign · May 28, 2020, 3:28pm

My microphone volume was too high, and it gave me really bad results. Try changing the gain of your microphone.

Topic		Replies	Views
DeepSpeech Problems with Speech Recognition Using Microphone DeepSpeech issue	12	2208	February 3, 2021
Model doesn't recognize record audios DeepSpeech	0	399	January 24, 2022
Deepspeech recognition rate DeepSpeech	16	8668	July 23, 2018
As of now, is Deep Speech viable for real-world applications? DeepSpeech	11	6117	January 9, 2020
DeepSpeech giving bad results DeepSpeech learning	5	2332	February 11, 2020

Different results from microphone and recorded audio

Related topics