I am using retrained Deepspeech 0.3 to do some inference from recoded call. However it always output me some results like
"we for we look on a spirsofalaea a i go to a wouldtilemottofolespurtthou oh we o o n to compare now or bocortoby i for both fought no one to go and be a tale to mocometothecousonanofalinbuta a "
I tried to adjust the model arguments LM_WEIGHT, VALID_WORD_COUNT_WEIGHT, and change the audio sampling rate and format, also chunked the length. None of them help.
Any advice on this? Thanks,
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Can you give us context ? How did you retrained, source of the data, what version of the inference code do you use ?
Hi lissyx, thanks for help. I haven’t retrained the model as no available labelled data. The source of data is recorded call from call centre. I am directly using python code to do the inference,
The accent is native but there are places talker try to correct himself and little bit background noise. Format: I tried 8 bit, 16bit and 32bit. Sampling rate: I tried 8000hz, 16000 hz and 32000 hz. I tried google api to do it, it worked very well which dropped the correction part and only leave words make sense there.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
Can you give the source format? Conversions can add artifacts that messes up with the data.
Hi Lissyx, thanks. I didn’t find the version in the output. But as you said , the sampling rate matters here. The way I changed the sampling rate was not right. Now, I applied pydub change the sampling rate which gives more reasonable output.
How did you change the sampling rate? I’m experiencing a similar issue, using ffmpeg to resample. From what I can tell, pydub uses ffmpeg to do its resampling:
Can you share any more details about what was the wrong way and the right way to resample the audio? And how big a difference did it make? (Did it eliminate the issue completely?)
Yes, the first method is I use audio software(Audacity) to adjust the sampling rate and later I installed the pydub with ffmpeg to do it. What I found out is for some long words, it works better. Not completely solve the issue as still something need to work on such as removing the noise.