Hi, I’m doing some basic evaluation of DeepSpeech on some audio that’s collected from spontaneous conversations between two people “in the wild”.
I’ve tried running some of this audio against the trained deepspeech-0.6.1 model. I’ve seen extremely low accuracy - so low that I think I’m doing something very wrong. Most transcripts don’t contain a single word that resembles or sounds like the audio, much less the correct word itself.
The original audio was encoded with 25-bit precision and sample rate of 48KHz, with two channels for caller and agent. Using sox, I’ve downsampled to 16KHz at 16-bit precision and split the audio into two separate mono files.
As an example, here is a wave file:
https://drive.google.com/open?id=12ioZb9R8TwijOEYyUg246cfOECaf5Lgl
I run it through deepspeech using the following command (which I believe is the normal pattern):
deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie --audio trimmed.wav
From this, I see the following transcription:
in a good
By contrast, this is my target transcription:
hi how are you good thank you
This particular example is particularly good insofar as the word “good” was correctly transcribed. Almost all other examples do worse.
Can someone give this a sanity check to confirm I’ve encoded the audio correctly? Is there anything else that I might be doing wrong? Or any reason this particular snippet of audio might be impossible for deepspeech-0.6.1 to decode successfully?
If there are no errors and it’s simply challenging due to the background noise or other features, is fine-tuning a viable method to get better performance?
Thanks in advance for any help you can provide!