DeepSpeech on RPi 4 - integration into speech_recognition - bad/slow detection

norbert · January 14, 2020, 7:33am

Dear all,

for our privacy aware smart speaker I have tried to integrate DeepSpeech into the Python speech_recognition module. Unfortunately it turns out that while it does recognize a bit, most words are wrongly recognized, or often only very few words from a sentence. Plus, it is very slow.

This contradicts several reports I heard that it runs very well on RPi4.

The integration into SpeechRecognition can be seen here, it is only one function: https://github.com/fossasia/speech_recognition/blob/e452e9f3295232a6f5de0dc789acf2e1a4311f5c/speech_recognition/init.py#L846

What it basically does is

language_model_file = ".../lm.binary"
trie_file = ".../trie"
prot_buffer_file = ".../output_graph.tflite"
beam_width = 500
lm_alpha = 0.75
lm_beta = 1.85
ds = Model(prot_buffer_file, beam_width)
desired_sample_rate = ds.sampleRate()
ds.enableDecoderWithLM(language_model_file, trie_file, lm_alpha, lm_beta)
raw_data = audio_data.get_raw_data(convert_rate=desired_sample_rate, convert_width=2)
recognized_metadata = ds.sttWithMetadata(np.frombuffer(raw_data, np.int16))

The rest is just finding the files and checking for parameters etc etc.

Is there anything the sticks out as completely wrong?

Thanks for any comments

Norbert

lissyx · January 14, 2020, 8:31am

It looks like this is 0.6.0. We have a bug on the 0.6.0 TFLite model that makes it mostly not working. Please try again with 0.6.1.

Quickly looking at the code, I see a lot of audio resampling and audio manipulations. Those can add artifacts that degrades the quality of the recognition with the current model.

Please give context on reproductibility of speech accuracy: it can also depend on speker’s accent, background noise, etc.

This is not really helpful. Please give more detailed informations.

reuben · January 14, 2020, 9:21am

That code is not using the streaming API at all, so it doesn’t surprise me that it’s slow. Does Uberi/speech_recognition not support streaming. Also, could you explain why you couldn’t use our Python package directly?

norbert · January 14, 2020, 2:28pm

Hi @lissyx, hi @reuben
thanks for your comments.

First or all, yes, that was 0.6.0, I will retry with 0.6.1. Thanks for the info.

Concerning the audio resampling: I don’t think there is a lot, just one to get it into the rate required by deepspeech. @lissyx do you have any other complaints? I don’t think that 1 resampling is bad, but necessary.

Concerning speech accuracy: Yes, I am not native, but sufficiently fluent and most STT systems recognize my pronunciation correctly. Background noise was not a problem, as I was testing at home, in silent surroundings.

Concerning speed: Sorry, I cannot give details - it felt like after the “talking part” it took again the same amount of time to recognize. This might be related to the question concerning “streaming API”, see below.

@reuben Yes, it is not using the streaming API, first because I am not the author of SpeechRecognition, secondly because I don’t know how to integrate it into SpeechRecognition. you ask why we don’t use the package directly: because we allow the user of our smart speaker/personal assistant to select the STT system, be it online Google or Sphinx. I want to add DeepSpeech, but it needs to integrate into the general environment, thus I am trying to get support for DeepSpeech into SpeechRecognition.

Again thanks everyone for the comments

lissyx · January 14, 2020, 2:31pm

It all depends on how resampling is done, and how it impacts the audio. I’m just listing what might impact.

Well, sorry, but current DeepSpeech dataset is very biaised towards American English, so it can have a big impact.

That would look like.

I don’t see any question regarding that, so far. Streaming API is documented and exampled at several places. Questions are welcome, but we don’t know speech_recognition so we can’t advise unless you ask more precisely.

norbert · January 24, 2020, 3:26pm

Hi @lissyx
sorry for the late reply, real world intercepts, as usual.

I have now tried with deepspeech 0.6.1, and I still get bad result. Maybe it is my pronounciation. A list of things I believed I said and what was detected:

“what is the time” – “well i am”
“tell me a joke” – " panado"
“who is David Bowie” – “who is the foe”

So as you see, there is quite some difference in what I expect and what comes out.

I tried also with the deepspeech command directly, but I’m not sure how to record appropriately, that is, the correct arecord invocation. I tried several, most of them ending with cannot fit 'int' into an index-sized integer errors from the deepspeech command.

lissyx · January 24, 2020, 3:36pm

I guess arecord --channels=1 --format=S16_LE should do it

Please make sure you speak loud enough and there is no noise. Please ensure you are not dropping frames (use two separated threads for audio and for deepspeech) if you use Streaming API.

Make sure you are enabling language model, and with proper parameters.

Please also triple check that you are using 0.6.1 tflite model file, we had to re-export it to fix a bug, and poor inference was a symptom.

For your accent, I can’t really tell more than yes, English model is biaised towards American accent.

Topic		Replies	Views
DeepSpeech Problems with Speech Recognition Using Microphone DeepSpeech issue	12	2204	February 3, 2021
Deepspeech recognition rate DeepSpeech	16	8625	July 23, 2018
Video and benchmarking results DeepSpeech	15	1661	February 6, 2020
Unable to install deepspeech on centos 6.9 DeepSpeech	36	4117	March 5, 2018
Horrible results on inference. Help DeepSpeech	2	918	July 10, 2020

DeepSpeech on RPi 4 - integration into speech_recognition - bad/slow detection

Related topics