DeepSpeech on RPi 4 - integration into speech_recognition - bad/slow detection

Dear all,

for our privacy aware smart speaker I have tried to integrate DeepSpeech into the Python speech_recognition module. Unfortunately it turns out that while it does recognize a bit, most words are wrongly recognized, or often only very few words from a sentence. Plus, it is very slow.

This contradicts several reports I heard that it runs very well on RPi4.

The integration into SpeechRecognition can be seen here, it is only one function:

What it basically does is

language_model_file = ".../lm.binary"
trie_file = ".../trie"
prot_buffer_file = ".../output_graph.tflite"
beam_width = 500
lm_alpha = 0.75
lm_beta = 1.85
ds = Model(prot_buffer_file, beam_width)
desired_sample_rate = ds.sampleRate()
ds.enableDecoderWithLM(language_model_file, trie_file, lm_alpha, lm_beta)
raw_data = audio_data.get_raw_data(convert_rate=desired_sample_rate, convert_width=2)
recognized_metadata = ds.sttWithMetadata(np.frombuffer(raw_data, np.int16))

The rest is just finding the files and checking for parameters etc etc.

Is there anything the sticks out as completely wrong?

Thanks for any comments


It looks like this is 0.6.0. We have a bug on the 0.6.0 TFLite model that makes it mostly not working. Please try again with 0.6.1.

Quickly looking at the code, I see a lot of audio resampling and audio manipulations. Those can add artifacts that degrades the quality of the recognition with the current model.

Please give context on reproductibility of speech accuracy: it can also depend on speker’s accent, background noise, etc.

This is not really helpful. Please give more detailed informations.

That code is not using the streaming API at all, so it doesn’t surprise me that it’s slow. Does Uberi/speech_recognition not support streaming. Also, could you explain why you couldn’t use our Python package directly?

Hi @lissyx, hi @reuben
thanks for your comments.

First or all, yes, that was 0.6.0, I will retry with 0.6.1. Thanks for the info.

Concerning the audio resampling: I don’t think there is a lot, just one to get it into the rate required by deepspeech. @lissyx do you have any other complaints? I don’t think that 1 resampling is bad, but necessary.

Concerning speech accuracy: Yes, I am not native, but sufficiently fluent and most STT systems recognize my pronunciation correctly. Background noise was not a problem, as I was testing at home, in silent surroundings.

Concerning speed: Sorry, I cannot give details - it felt like after the “talking part” it took again the same amount of time to recognize. This might be related to the question concerning “streaming API”, see below.

@reuben Yes, it is not using the streaming API, first because I am not the author of SpeechRecognition, secondly because I don’t know how to integrate it into SpeechRecognition. you ask why we don’t use the package directly: because we allow the user of our smart speaker/personal assistant to select the STT system, be it online Google or Sphinx. I want to add DeepSpeech, but it needs to integrate into the general environment, thus I am trying to get support for DeepSpeech into SpeechRecognition.

Again thanks everyone for the comments

It all depends on how resampling is done, and how it impacts the audio. I’m just listing what might impact.

Well, sorry, but current DeepSpeech dataset is very biaised towards American English, so it can have a big impact.

That would look like.

I don’t see any question regarding that, so far. Streaming API is documented and exampled at several places. Questions are welcome, but we don’t know speech_recognition so we can’t advise unless you ask more precisely.

Hi @lissyx
sorry for the late reply, real world intercepts, as usual.

I have now tried with deepspeech 0.6.1, and I still get bad result. Maybe it is my pronounciation. A list of things I believed I said and what was detected:

  • “what is the time” – “well i am”
  • “tell me a joke” – " panado"
  • “who is David Bowie” – “who is the foe”

So as you see, there is quite some difference in what I expect and what comes out.

I tried also with the deepspeech command directly, but I’m not sure how to record appropriately, that is, the correct arecord invocation. I tried several, most of them ending with cannot fit 'int' into an index-sized integer errors from the deepspeech command.

I guess arecord --channels=1 --format=S16_LE should do it

Please make sure you speak loud enough and there is no noise. Please ensure you are not dropping frames (use two separated threads for audio and for deepspeech) if you use Streaming API.

Make sure you are enabling language model, and with proper parameters.

Please also triple check that you are using 0.6.1 tflite model file, we had to re-export it to fix a bug, and poor inference was a symptom.

For your accent, I can’t really tell more than yes, English model is biaised towards American accent.