Pretrained Model cannot provide accurate English words

I am using retrained Deepspeech 0.3 to do some inference from recoded call. However it always output me some results like

"we for we look on a spirsofalaea a i go to a wouldtilemottofolespurtthou oh we o o n to compare now or bocortoby i for both fought no one to go and be a tale to mocometothecousonanofalinbuta a "

I tried to adjust the model arguments LM_WEIGHT, VALID_WORD_COUNT_WEIGHT, and change the audio sampling rate and format, also chunked the length. None of them help.

Any advice on this? Thanks,

Can you give us context ? How did you retrained, source of the data, what version of the inference code do you use ?

Hi lissyx, thanks for help. I haven’t retrained the model as no available labelled data. The source of data is recorded call from call centre. I am directly using python code to do the inference,

ds = Model(‘Test/output_graph.pb’, 26, 9, ‘Test/alphabet.txt’, 500)
ds.enableDecoderWithLM(str(‘Test/alphabet.txt’), str(‘Test/lm.binary’), str(‘models/trie’), 1.5, 2.1)
fs, audio = wav.read(‘short_test.wav’)
processed_data = ds.stt(audio, fs)

Any details ? Like format, sampling rate ? Are the people speaking with a native accent ?

The whole output would help as well, to check the version of libdeepspeech.so you are using.

The accent is native but there are places talker try to correct himself and little bit background noise. Format: I tried 8 bit, 16bit and 32bit. Sampling rate: I tried 8000hz, 16000 hz and 32000 hz. I tried google api to do it, it worked very well which dropped the correction part and only leave words make sense there.

Can you give the source format? Conversions can add artifacts that messes up with the data.

Source is 32000HZ with 32 bit.

And how many channels ?

oh, forgot to mention that. it is Mono.

Ok. I’m still waiting on the exact version of libdeepspeech.so you are using …

Is it possible you might share some of them ?

I cannot share them. I am not sure the version of libdeepspeech.so. The Deepspeech version is 0.3.0.

It should be printed on the output when you run it …

Hi Lissyx, thanks. I didn’t find the version in the output. But as you said , the sampling rate matters here. The way I changed the sampling rate was not right. Now, I applied pydub change the sampling rate which gives more reasonable output.

You should also try the new decoder in master, it should alleviate these problems. There’s instructions here: https://github.com/mozilla/DeepSpeech/issues/1156#issuecomment-434351398

Why don’t you share the whole output? It should be printed, by that call https://github.com/mozilla/DeepSpeech/blob/master/native_client/deepspeech.cc#L360

How did you change the sampling rate? I’m experiencing a similar issue, using ffmpeg to resample. From what I can tell, pydub uses ffmpeg to do its resampling:

http://github.com/jiaaro/pydub/blob/master/API.markdown#audiosegmentexport

Can you share any more details about what was the wrong way and the right way to resample the audio? And how big a difference did it make? (Did it eliminate the issue completely?)

Yes, the first method is I use audio software(Audacity) to adjust the sampling rate and later I installed the pydub with ffmpeg to do it. What I found out is for some long words, it works better. Not completely solve the issue as still something need to work on such as removing the noise.