Very low accuracy with 0.6.1 model - would you sanity check?

Hi, I’m doing some basic evaluation of DeepSpeech on some audio that’s collected from spontaneous conversations between two people “in the wild”.

I’ve tried running some of this audio against the trained deepspeech-0.6.1 model. I’ve seen extremely low accuracy - so low that I think I’m doing something very wrong. Most transcripts don’t contain a single word that resembles or sounds like the audio, much less the correct word itself.

The original audio was encoded with 25-bit precision and sample rate of 48KHz, with two channels for caller and agent. Using sox, I’ve downsampled to 16KHz at 16-bit precision and split the audio into two separate mono files.

As an example, here is a wave file:
https://drive.google.com/open?id=12ioZb9R8TwijOEYyUg246cfOECaf5Lgl

I run it through deepspeech using the following command (which I believe is the normal pattern):

deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio trimmed.wav

From this, I see the following transcription:
in a good

By contrast, this is my target transcription:
hi how are you good thank you

This particular example is particularly good insofar as the word “good” was correctly transcribed. Almost all other examples do worse.

Can someone give this a sanity check to confirm I’ve encoded the audio correctly? Is there anything else that I might be doing wrong? Or any reason this particular snippet of audio might be impossible for deepspeech-0.6.1 to decode successfully?

If there are no errors and it’s simply challenging due to the background noise or other features, is fine-tuning a viable method to get better performance?

Thanks in advance for any help you can provide!

I listened to the clip and the quality isn’t very high. The voice sounds distorted to me - I don’t know if that was in the original recording or something that went wrong with the conversion.

Fine-tuning may help get better results, in particular the augmentation features. These haven’t been extensively tested yet but they distort the data which may help it to better understand clips like the sample you posted.

I can’t listen now, but I can only emphasize this. Sadly, our current model is not yet good with noisy / poor quality audio, as well as with non american english.

So between noisy audio, low volume, pace of speech, that’s already a lot of variables that can explain why.

Trivial thing, sox depending on how you call it might to dithering by default, which impairs (a lot) recognition. You should make sure it’s actually disabled.

It could very much be, if you have training material available.

1 Like

To it I only can add, that setting sox parameter --compression 0.0 for audio processing also helps a lot to avoid audio distortion.

Thank you for the replies and suggestions. I tried to repeat the sox conversion with the suggested modifications (no dither, no compression), but the model didn’t do a better job decoding the audio.

My next steps are to train a model on CommonVoice data with noise data augmentation, and then try fine-tuning to my smaller dataset.

How bad is your original audio file ? If your source is as bad as @dabinat analyzed, then no conversion is going to improve and fine-tuning to your audio quality is going to be the best path, though it requires some amount of data.

I don’t know how to quantify how bad it is. =P My first post in this thread has a link to one utterance’s wave file.

Trusting @dabinat’s feedback on this, when you listen to the original audio sample fresh from your capture and the audio you shared, is there any difference in quality ?