DeepSpeech on RPi 4 - integration into speech_recognition - bad/slow detection

Dear all,

for our privacy aware smart speaker I have tried to integrate DeepSpeech into the Python speech_recognition module. Unfortunately it turns out that while it does recognize a bit, most words are wrongly recognized, or often only very few words from a sentence. Plus, it is very slow.

This contradicts several reports I heard that it runs very well on RPi4.

The integration into SpeechRecognition can be seen here, it is only one function: https://github.com/fossasia/speech_recognition/blob/e452e9f3295232a6f5de0dc789acf2e1a4311f5c/speech_recognition/init.py#L846

What it basically does is

language_model_file = ".../lm.binary"
trie_file = ".../trie"
prot_buffer_file = ".../output_graph.tflite"
beam_width = 500
lm_alpha = 0.75
lm_beta = 1.85
ds = Model(prot_buffer_file, beam_width)
desired_sample_rate = ds.sampleRate()
ds.enableDecoderWithLM(language_model_file, trie_file, lm_alpha, lm_beta)
raw_data = audio_data.get_raw_data(convert_rate=desired_sample_rate, convert_width=2)
recognized_metadata = ds.sttWithMetadata(np.frombuffer(raw_data, np.int16))

The rest is just finding the files and checking for parameters etc etc.

Is there anything the sticks out as completely wrong?

Thanks for any comments

Norbert

It looks like this is 0.6.0. We have a bug on the 0.6.0 TFLite model that makes it mostly not working. Please try again with 0.6.1.

Quickly looking at the code, I see a lot of audio resampling and audio manipulations. Those can add artifacts that degrades the quality of the recognition with the current model.

Please give context on reproductibility of speech accuracy: it can also depend on speker’s accent, background noise, etc.

This is not really helpful. Please give more detailed informations.

That code is not using the streaming API at all, so it doesn’t surprise me that it’s slow. Does Uberi/speech_recognition not support streaming. Also, could you explain why you couldn’t use our Python package directly?

Hi @lissyx, hi @reuben
thanks for your comments.

First or all, yes, that was 0.6.0, I will retry with 0.6.1. Thanks for the info.

Concerning the audio resampling: I don’t think there is a lot, just one to get it into the rate required by deepspeech. @lissyx do you have any other complaints? I don’t think that 1 resampling is bad, but necessary.

Concerning speech accuracy: Yes, I am not native, but sufficiently fluent and most STT systems recognize my pronunciation correctly. Background noise was not a problem, as I was testing at home, in silent surroundings.

Concerning speed: Sorry, I cannot give details - it felt like after the “talking part” it took again the same amount of time to recognize. This might be related to the question concerning “streaming API”, see below.

@reuben Yes, it is not using the streaming API, first because I am not the author of SpeechRecognition, secondly because I don’t know how to integrate it into SpeechRecognition. you ask why we don’t use the package directly: because we allow the user of our smart speaker/personal assistant to select the STT system, be it online Google or Sphinx. I want to add DeepSpeech, but it needs to integrate into the general environment, thus I am trying to get support for DeepSpeech into SpeechRecognition.

Again thanks everyone for the comments

It all depends on how resampling is done, and how it impacts the audio. I’m just listing what might impact.

Well, sorry, but current DeepSpeech dataset is very biaised towards American English, so it can have a big impact.

That would look like.

I don’t see any question regarding that, so far. Streaming API is documented and exampled at several places. Questions are welcome, but we don’t know speech_recognition so we can’t advise unless you ask more precisely.