As of now, is Deep Speech viable for real-world applications?

Is anyone using Deep Speech in an application and what results are you getting, compared to, say, Google voice recognition on regular audio captured from a smartphone?

There is a blog post proclaiming 5.6% word error rate, which sounds great, but digging deeper it appears the result was biased by testing data leaking into the training set. Additionaly, the latest model (0.3) has a lower WER (11.2% on LibriSpeech clean) due to optimizations (see Deepspeech accuracy decreasing?).

Also, I’ve not been able to find any video demos. So far the only thing close to a demo I have come across is this video:

With a bit of tuning, DeepSpeech can be quite good. We built a frontend with an API so we could use DeepSpeech to transcribe voicemails and recordings for our customers. We also contributed code and documentation to FusionPBX so this would be really simple for other people to use, as we want to encourage use of DeepSpeech.

One big gotcha is VAD, it seems to slice audio files too finely, resulting in plentiful misspelled words and bad results, whereas feeding the same audio file directly to DeepSpeech will return a fairly good transcript (though you may use up all your RAM when transcribing 1+ minute files).

which is rather curious as the models are trained on <10sec audio clips

Have you done any investigation where the errors on short segments come from? e.g.

  • acoustic vs language model
  • can you see the same errors when segmenting the audio manually rather than using webrtcVAD?

By acoustic vs language model do you mean trying transcription without the language model? If so, we did that accidentally in older versions of DeepSpeech Frontend, the results were poor. This was prior to VAD though.

Manually breaking apart audio doesn’t create these incorrect transcriptions. My working theory (corroborated by others on the #machinelearning IRC channel) is that WebRTCVAD is a touch aggressive in slicing up audio, so the beginning and the end of the word can sometimes get partially chopped off resulting in DeepSpeech not understanding the malformed audio its given.

yeah, one way to check the results of the acoustic model in greater detail is to look at the output of the softmax layer and see which characters were guessed as most probable,

e.g. is the expected character in 5 most probable characters guessed by the model?

if all expected characters are there after acoustic phase, then the language model part is the culprit

Even with VAD cutting first/last word, the middle part still should be reasonably transcribed unless the bad start/end of the word sequence throws the language model totally off.

Any pointers as to how to look at the output of the softmax layer? Haven’t had much time to dig into this, been sidetracked doing some optimization work on transcoding WAVs to Opus in an optimal manner.

I’m using a script inspired by the inference part of the deepspeech script in python for that. Also described here:

https://discourse.mozilla.org/t/can-i-use-pre-trained-model-with-deepspeech-py/23577/8

Working on TF Lite version integrated into some other app through the mozillaspeechlibrary code, I can confirm one needs to be careful around WeRTC VAD, but I could get nice results on a Pixel 2 even with some background noise, and in french.

Good stuff, thanks for the insights!

Shouldn’t adding a buffer time before and after the webrtcvad output solve the problem?

For example,
If VAD says the voice lies between 4.20 sec(start) and 6.80 sec(end)
we can cut the chunk from
4.18 sec to 6.82 sec
i.e. a 20 ms buffer time, before and after the start and end time

The only problem here would be to choose the exact buffer time to use.

Am i correct in following this approach to deal with this error?
Thanks in advance

Yes, that is the approach I’ve used in the past to avoid cutting into the beginning of the utterance. It worked fairly well for my use case.

I have two questions.
A. What buffer time did you use?
B. What to do with the extra predictions due to the language model mentioned below.

In my case, I tried with 0.3 to 0.1 sec initially. This resulted in extra junk predictions at the beginning and end.
Eg:
src: “hello good morning this is amit”
res: “ok hello good morning this is amit hi”

Though there wasn’t any speech in the extra buffer time.
I realise that these extra predictions are the most frequent words in my dictionary.

This also happens when VAD outputs chunks with absolutely no speech but noise.
Even in these situations, the predictions come out to be the most frequent words like "ok"s and "hi"s

How to deal with these two situations?