As of now, is Deep Speech viable for real-world applications?

(Misho Kostira) #1

Is anyone using Deep Speech in an application and what results are you getting, compared to, say, Google voice recognition on regular audio captured from a smartphone?

There is a blog post proclaiming 5.6% word error rate, which sounds great, but digging deeper it appears the result was biased by testing data leaking into the training set. Additionaly, the latest model (0.3) has a lower WER (11.2% on LibriSpeech clean) due to optimizations (see Deepspeech accuracy decreasing?).

Also, I’ve not been able to find any video demos. So far the only thing close to a demo I have come across is this video:

(Dan) #2

With a bit of tuning, DeepSpeech can be quite good. We built a frontend with an API so we could use DeepSpeech to transcribe voicemails and recordings for our customers. We also contributed code and documentation to FusionPBX so this would be really simple for other people to use, as we want to encourage use of DeepSpeech.

One big gotcha is VAD, it seems to slice audio files too finely, resulting in plentiful misspelled words and bad results, whereas feeding the same audio file directly to DeepSpeech will return a fairly good transcript (though you may use up all your RAM when transcribing 1+ minute files).

(Yv) #3

which is rather curious as the models are trained on <10sec audio clips

Have you done any investigation where the errors on short segments come from? e.g.

  • acoustic vs language model
  • can you see the same errors when segmenting the audio manually rather than using webrtcVAD?

(Dan) #4

By acoustic vs language model do you mean trying transcription without the language model? If so, we did that accidentally in older versions of DeepSpeech Frontend, the results were poor. This was prior to VAD though.

Manually breaking apart audio doesn’t create these incorrect transcriptions. My working theory (corroborated by others on the #machinelearning IRC channel) is that WebRTCVAD is a touch aggressive in slicing up audio, so the beginning and the end of the word can sometimes get partially chopped off resulting in DeepSpeech not understanding the malformed audio its given.

(Yv) #5

yeah, one way to check the results of the acoustic model in greater detail is to look at the output of the softmax layer and see which characters were guessed as most probable,

e.g. is the expected character in 5 most probable characters guessed by the model?

if all expected characters are there after acoustic phase, then the language model part is the culprit

Even with VAD cutting first/last word, the middle part still should be reasonably transcribed unless the bad start/end of the word sequence throws the language model totally off.

(Dan) #6

Any pointers as to how to look at the output of the softmax layer? Haven’t had much time to dig into this, been sidetracked doing some optimization work on transcoding WAVs to Opus in an optimal manner.

(Yv) #7

I’m using a script inspired by the inference part of the deepspeech script in python for that. Also described here:

(Lissyx) #8

Working on TF Lite version integrated into some other app through the mozillaspeechlibrary code, I can confirm one needs to be careful around WeRTC VAD, but I could get nice results on a Pixel 2 even with some background noise, and in french.

(Misho Kostira) #9

Good stuff, thanks for the insights!