Deepspeech recognition rate


#1

Hi all,

working with deepspeech we noticed that our overall recognition rate is not good. This doesn’t accord with what we were expecting, especially not after reading Baidu’s Deepspeech research paper.
We are using the cpu architecture and run deepspeech with the python client. (Switching to the gpu-implementation would only increase inference speed, not accuracy, right?)
To get a measurement of the accuracy we used a python implementation of the WER (word error rate) to analyse the results. The overall performance on 20 samples (male & female) is an error rate of 90%.
We used the pre-trained model as described in the README. The audio files we used were self-recorded 16kHz, mono, wav-format files with some background noise.

🐳  marvin models # deepspeech output_graph.pb /path/to/file/05_m_rosie_robot.wav alphabet.txt lm.binary trie 
Loading model from file output_graph.pb
Loaded model in 0.514s.
Loading language model from files lm.binary trie
Loaded language model in 1.930s.
Running inference.
utewilknorosi 
Inference took 12.948s for 4.250s audio file.

We would like to provide the audio file as well (Text: “From what series do you know Rosie the robot?”) but .wav isn’t supported for upload. Therefore via wetransfer: https://we.tl/J4wpgkS94Y

Please advise on any further information you need to investigate and reproduce this issue.

We’re looking forward to hearing from y’all!


(Lissyx) #2

Hello,

90% seems clearly wrong. I’m not able to have a look at the WAV files right now, but on the top of my head I’d have two suspects:

  • recording itself,
  • and the background noise.

The background noise will obviously alter the propre recognition since the current model has been trained on clean audio. Some people on the team are hacking to perform data augmentation with noise to improve the training, but that is not yet part of what we have.

The recording itself could be also a source of error: even if the sound seems to be clean enough, it might contain some lower or higher frequencies that are interacting badly. This is also something we are looking into, e.g., https://github.com/mozilla/DeepSpeech/issues/1259.

I clearly remember that changing from one mic to another when performing some tests a few weeks ago with others that we could go from 60% error rate to 10%, all others parameters being equal.

Maybe others can weight an opinion here, but it might be useful to get more details on your recording setup ? If you can, trying to record with another ?


(Lissyx) #3

Some similar experience and some feedback from some other contributor (@alex.n.james) : Performance for native English speaker with pre-trained model?


(Julien Tane) #4

hello I tried to download the wav file… but I found it nearly inaudible.


(Julien Tane) #5

Hello,

to clarify my last statement… I had to increase the volume to hear the document.

I tried to look deeper into it:
$ mediainfo ~/Downloads/05_m_rosie_robot.wav
General
Complete name : /home/jta/Downloads/05_m_rosie_robot.wav
Format : Wave
File size : 66.4 KiB
Duration : 4s 250ms
Overall bit rate mode : Constant
Overall bit rate : 128 Kbps

Audio
Format : PCM
Format settings, Endianness : Little
Format settings, Sign : Unsigned
Codec ID : 1
Duration : 4s 250ms
Bit rate mode : Constant
Bit rate : 128 Kbps
Channel(s) : 1 channel
Sampling rate : 16.0 KHz
Bit depth : 8 bits
Stream size : 66.4 KiB (100%)

Bit depth is 8 bit

From the README:
Once everything is installed you can then use the deepspeech binary to do speech-to-text on short, approximately 5 second, audio files (currently only WAVE files with
16-bit, 16 kHz, mono are supported in the Python client):

so maybe this could play a role.

I tried to have pocketsphinx interpret it and it was having problem because it was also expecting 16bit.


#6

Hello everyone,

thank you all for your input. We will investigate your suggestions and report on the results.


#7

Thank you for that detailed answer jtane. Trying another file (one that succeeded) that also had a bit depth of 8 bit, we had a success rate of 100%, so it appears not to be the issue. Apparently this has been already fixed to work with 8 bit depth also.
But thanks again for pointing that out, we clearly missed that information from the readme.


#8

Hi lissyx,

first of, thank you for the clarification. We have made some adjustments on the recording and are now up to 58% correct rate. So this seems to be the fix that we’ll keep working on.


(Lissyx) #9

That’s interesting, can you share the adjustements you made? It might help us improve training and help others get better results :slight_smile:


#10

Of course. I’ll post an update once we have found the best set-up together with the initial set-up.


#11

Hello everyone,

please excuse the downtime.
As promised, here comes an update of our approach:

Recording settings

At the beginning we recorded the samples with arecord and the needed parameters so deepspeech could work with the files, precisely:

arecord -t wav -r 16000 -d 3 test_01.wav 

(wav format, 16kHz, 3 seconds duration)

As discussed, this resulted in a pretty poor sound quality. To fix that, we switched to the sample format dat which appeared to have the best quality:

arecord -f dat -d 3 test_01.wav

To convert it for deepspeech, sox was used:

sox input.wav -c 1 -r 16000 -b 16 output.wav

(desired output quality: 1 channel/mono, 16kHz rate, 16 bit depth)

With this adjustement our word error rate dropped from about 90% to 50%.

Bandpass Filter

@lissyx, you suggested the use of a bandpass filter and provided some links where this topic is discussed and also the procedure of applying it. In our situation, applying a bandpass filter only had the effect of making things worse, so we discarded it. Still thank you for pointing out that approach.

Recording hardware

Interestingly, using a Rode microphone (N3594) didn’t increase the sound quality noticeably in comparison to the built-in mic of our laptop.

Final thoughts

We hope that sharing our experience may help you to get better results with training, maybe we even can prevent others from running into the issues we had. Also, if anything is unclear in this description, please feel free to ask.
Altogether we really enjoyed working with deepspeech, even though we did not manage to get close to an error rate of <10% as it is described here.

If anybody has suggestions on what else we could change to decrease our error rate, we’re very happy to hear from you.


(Lissyx) #12

Thanks for the detailed feedback. Re-checking the manpage of arecord: -f dat (16 bit little endian, 48000, stereo) [-f S16_LE -c2 -r48000]

So I’m wondering if you could give a try, under the same conditions, by just doing: -f S16_BE -c1 -r16000 (I think we need big endian).


(Yv) #13

that’s interesting, mediainfo for sample audio from the release 4507-16021-0012.wav shows little endian:
Format : Wave
File size : 85.5 KiB
Duration : 2 s 735 ms
Overall bit rate mode : Constant
Overall bit rate : 256 kb/s

Audio
Format : PCM
Format settings, Endianness : Little
Format settings, Sign : Signed
Codec ID : 1
Duration : 2 s 735 ms
Bit rate mode : Constant
Bit rate : 256 kb/s
Channel(s) : 1 channel
Sampling rate : 16.0 kHz
Bit depth : 16 bits
Stream size : 85.5 KiB (100%)


(Lissyx) #14

Well, you just proved I should have checked, we seem to need little-endian :slight_smile:


#15

Hey all,
as @yv001 mentioned, little endian is needed for now. And wave recording only allows little endian, that’s why we were using it in the first place. Playing around a bit with the settings @lissyx suggested, we still managed to get a further decrease in error rate. The solution was to use .au file type for recording.

Recording:

arecord -f S16_BE -c 1 -r 16000 -t au -d 3 file_name.au

Converting:

sox input.au output.wav

And the great news is:
We’re now at about 40% error rate, so steadily making progress. However, the test setup wasn’t very big this time, so please don’t be too hard on me with these numbers. And if anybody is still running over anything, we’ll gladly give it a try.


(Vincent Foucault) #16

Very interesting. Thanks for sharing.


(BadrEL) #17

Hi,
Isn’t it possible to get the same result using sox only, without using arecord?