DeepSpeech Problems with Speech Recognition Using Microphone

jens · January 31, 2021, 11:27am

Hello all,
I am using DeepSpeech 0.9.3 with tflite on a Raspberry Pi 4 B. The installation went flawlessly, however I now have the following problem:
When playing wav-files the speech-to-text works great, when using the microphone hardly any word is recognized correctly.
It is not due to the microphone, I have already tried different ones and recorded the played wav-file also via the microphone. So it doesn’t seem to be the hardware. Does anyone have any idea what the problem is?

Many thanks in advance.

dkreutz · January 31, 2021, 1:29pm

Can you make a recording of the mic-signal e.g. with arecord and playe it back? How does it sound, maybe distorted and/or noisy?

jens · January 31, 2021, 2:59pm

What I did so far is that Ir recorded an Audio on my cellphone, I once converted that to a wav-file, the result was almost perfekt. Then I activated the Mikrophone and Played that Audio, the result is really Bad. I also recorded (arecord) the Audio from my cellphone with the Mikrophon and saved that as a wav-file, the result was like above Almost perfect. So it seems that during the microphone Activation and direct “Translation” is a problem.

SeaLiteral · January 31, 2021, 6:17pm

Just a random thought, but does DeepSpeech know the sample rate? WAV files specify they’re sample rates, so when reading from a file, DeepSpeech can read from the file what the sample rate is. If it’s not coming from a file, maybe it could be guessing and guessing wrong. But I could be wrong. I haven’t used DeepSpeech myself on anything else than WAV files, and I haven’t even got it working for other languages than English.

othiele · February 1, 2021, 8:43am

Last time I checked sox is used internally to make sure that the input is 16KHz wav input. So you don’t have to check that. If you can transcribe wavs fine, it is not DeepSpeech but the way you feed it audio. Have you checked the examples?

jens · February 3, 2021, 7:10am

I’m using sox and I also tried different sample rates vom 8000 khz to 128000 khz. The result was different, but not better or worse. I also checked the examples, they worked fine like all wav-files did.

othiele · February 3, 2021, 7:47am

Again, it is working on Raspberry 4s. So your problem has to do with the way you get the audio from the mic to DeepSpeech.

Unless you post the code, it will be hard for us to give you further advice. You didn’t give any indication of what you do and it looks like you didn’t try any of the examples. Sorry, can’t help you .

jens · February 3, 2021, 7:57am

I’m sorry if I didn’t get your answer right. I’m pretty new with all of this stuff. What code do you mean? The code if I’m doing the speech-to-text command?

othiele · February 3, 2021, 8:07am

Pretty simple, DeepSpeech works with wav, but not with microphone. So problem is between mic input and DS command. What is the code for that? Why didn’t you try one of the examples?

jens · February 3, 2021, 8:36am

Thanks for your help in advance. So here is the code:

Running an example:

pi@raspberrypi:~/DeepSpeech $ deepspeech --model deepspeech-0.9.3-models.tflite --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav
Loading model from file deepspeech-0.9.3-models.tflite
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
Loaded model in 0.0578s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.0177s.
Running inference.
**experience proves this**
Inference took 7.162s for 1.975s audio file.

Running a wav-file which is recorded via cellphone and then transformed to .wav. This worked pretty good.

pi@raspberrypi:~/DeepSpeech $ deepspeech --model deepspeech-0.9.3-models.tflite --scorer deepspeech-0.9.3-models.scorer --audio audio/DeepSpeechTest44100khz.wav
Loading model from file deepspeech-0.9.3-models.tflite
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
Loaded model in 0.00259s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.000485s.
Warning: original sample rate (44100) is different than 16000hz. Resampling might produce erratic speech recognition.
Running inference.
one two three four five six seven eight nine ten this is a test pick up boxes collected boxes get me boxes one above is going to be wonderful the weather is bad
Inference took 18.162s for 23.127s audio file.

Running a .wav file recorded via raspberry pi (with 4-mic respeaker)

pi@raspberrypi:~/DeepSpeech $ deepspeech --model deepspeech-0.9.3-models.tflite --scorer deepspeech-0.9.3-models.scorer --audio audio/DeepSpeechTestArecord64000khz.wav
Loading model from file deepspeech-0.9.3-models.tflite
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
Loaded model in 0.00286s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.000537s.
Warning: original sample rate (64000) is different than 16000hz. Resampling might produce erratic speech recognition.
Running inference.
> one two three four five six seven eight nine ten is is in test pickaxes cornet oceanos your botanising be wonderful to wiesbaden
Inference took 21.370s for 25.000s audio file.

Using the microphone (4-mic respeaker). I played the audio file from cellphone so that there so different emphasis or so on.

pi@raspberrypi:~/DeepSpeech/DeepSpeech-examples/mic_vad_streaming $ python3 mic_vad_streaming.py -m deepspeech-0.9.3-models.tflite -s deepspeech-0.9.3-models.scorer
Initializing model…
INFO:root:ARGS.model: deepspeech-0.9.3-models.tflite
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
INFO:root:ARGS.scorer: deepspeech-0.9.3-models.scorer
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround40
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround41
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround50
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround51
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround71
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field port
ALSA lib pcm_a52.c:823:(_snd_pcm_a52_open) a52 is only for playback
ALSA lib conf.c:5014:(snd_config_expand) Unknown parameters {AES0 0x6 AES1 0x82 AES2 0x0 AES3 0x2 CARD 0}
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM iec958:{AES0 0x6 AES1 0x82 AES2 0x0 AES3 0x2 CARD 0}
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for card
ALSA lib pcm_hw.c:1822:(_snd_pcm_hw_open) Invalid value for card
ALSA lib pcm_hw.c:1822:(_snd_pcm_hw_open) Invalid value for card
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
Listening (ctrl-C to exit)…
> Recognized: one two three four for
> Recognized: seven
> Recognized:
> Recognized: not
> Recognized:
> Recognized: this is a test
> Recognized: books
> Recognized: octopus
> Recognized: get your boxes
> Recognized:
> Recognized: professing to be wonderful to water is bad
^CTraceback (most recent call last):
File “mic_vad_streaming.py”, line 224, in
main(ARGS)
File “mic_vad_streaming.py”, line 182, in main
for frame in frames:
File “mic_vad_streaming.py”, line 130, in vad_collector
for frame in frames:
File “mic_vad_streaming.py”, line 114, in frame_generator
yield self.read()
File “mic_vad_streaming.py”, line 82, in read
return self.buffer_queue.get()
File “/usr/lib/python3.7/queue.py”, line 170, in get
self.not_empty.wait()
File “/usr/lib/python3.7/threading.py”, line 296, in wait
waiter.acquire()
KeyboardInterrupt

The original text of the recorded file is:

one two three for five six seven eight nine ten
this is a test

pick up boxes
get new boxes
collect boxes

the new appartment is gonna be wonderful
the weather is bad

I hope this is what you meant.

othiele · February 3, 2021, 8:51am

How did you record that? Looks like your are playing a 44KHz audio and record it with 64KHz somehow only to downsample it again to 16KHz. Not a gamestopper, but strange.

All the error messages make me wonder, but not an expert on those. @dkreutz, do you know whether that looks ok? Maybe some driver issue?
You didn’t specify a sample rate? Did you read the docs?
This uses VAD and splits the audio, so different results are to be expected. Try VAD of 0.

dkreutz · February 3, 2021, 9:32am

Linux Audio is a pain in the back. Coincidentally i own a Respeaker mic array v2 - i will give it a try with the vad-streaming demo, but this might take some time (quite tight schedule in real life nowadays…)

@jens: which exact version of the Respeaker array do you have: hat for RPI, v1, v2 or the USB? which firmware do you use - multi channel or single channel output?

othiele · February 3, 2021, 8:15pm

Check the other answer.