My DeepSpeech doesn't recognise as much as I would have thought?

Hi all,

I am brand new with DeepSpeech and tried this example out : Mic Vad Example - it is pretty much exactly what I want to do, convert a stream of audio from microphone and I got this example working using this scorer and model:

curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0/deepspeech-0.7.0-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0/deepspeech-0.7.0-models.scorer

As per the getting started guide : https://deepspeech.readthedocs.io/en/v0.7.1/?badge=latest

It does recognise, simple english statements but fails to recognise words such as:

“Oh”, “Hmm”, “Do”, “Ok”, “Hi” and it does not understand as far as I can see any swear words such as f**k, or words which are not very basic english. It appears to not pickup the first spoken word very accurately in a sentence also.

I also don’t see it understand the alphabet for example “A” , “B” , “C” I get nothing recognised – how can we make it recognise?

Is there an alternative model or scorer that may be more deeply trained? Any advice on improving it’s ability to pickup these “filler” words in a spoken sentence such as “ugh”, or “hffffff (huffing sound)”. There may well be things within this example that I can tweak to improve the capability of the recognition and I would love to hear any suggestions!

Many thanks - immensely exciting technology!

Most of our datasets skip the disfluencies like “hm” and “uh” so those are not going to be picked up. There’s no blacklist of swear words or anything like that, if they’re not getting picked up it’s more likely to be either from the recording conditions (noisy environment, microphone quality, too much echo) or the accent of the speakers. Most of the training data is American English, and so far it’s mostly clean recording conditions. We’re working on augmenting our training set with noisy, echoy, compressed, etc. audio to make the model more robust to such cases, so maybe future releases will work best if this is indeed what’s causing your issue.

If your intended use case is one with a limited vocabulary, for example if you want to transcribe lots of recordings of people talking about the same subject, then building a custom language model with textual data matching the use case can significantly boost the accuracy of the transcription. If you’re doing general recognition with no specific use case in mind, then you’d have to have labeled audio data to train or fine tune a model with.

Hi Reuben,

I think it would be great to start with the best publically available model, and then augment it for my specific use case.

I had a look at Training docs but it didn’t seem terribly easy to approach.

Is there a good guide to take a starting point model, and combine it with my custom additions? I don’t really understand how any of the training side of DeepSpeech works - but I imagine it involves word sized input audio data, followed by an associated text version of that word?

If the above is the case, can you cross train models with say googles speech to text which may be able to detect some of these unknown words consistently to get the associated text more simply?

Many thanks,

Andy

Correct, it isn’t terribly easy. Most relevant docs are the fine tuning section, but you should familiarize yourself with training in general before doing that.

There is plenty of info around on how DeepSpeech works, from my blog posts on Hacks to the Distill guide on CTC to other guides on the internet on how CTC models work.

This violates Google’s terms for their speech to text service.