Poor recognition

I hope this is a good place: We want to use DeepSpeech to ‘translate’ voicemail messages to text. I have installed DeepSpeed 0.9.3 with the appropriate pbmm and scorer and it seems to ‘work’. But the output is very inaccurate to the point of unusable. My sample is very clear to my ear. Our vmail system exports 8000hz WAV files. I get a warning that it needs to be 16000hz, but it processes anyway (inaccurately). I used ffmpeg to convert it to 16000hz and get different deepspeech output, but no more accurate. Is this a sample rate issue? Assuming I can’t do anything about the voicemail output, is there anything else I can do?

Converting to 16K Hz is a good idea.

As for the accuracy of the model, I can suggest two things.

  1. Use version 1.3.0 of Coqui-STT with a newer model (using TFLight format)

  2. Use a custom scorer based on your data as the one you are using is a generic one which needs more sentences from your data.