Performance for native English speaker with pre-trained model?

Hi, I am just trying to get an initial feel for DeepSpeech. How it might work against my voice. So I did the following:
installed deepspeech with pip on my MacBook;
recorded an audio of me talking as clearly and slowly as possible and converted it like this:
sox data/bbt_talking/clear_test.wav -r 16000 -b 16 -c 1 data/bbt_talking/clear_test_b16_c1_r16k.wav
ran deepspeech with the pre-trained English model like this:
deepspeech models/output_graph.pb data/bbt_talking/clear_test_b16_c1_r16k.wav models/alphabet.txt models/lm.binary models/trie
The text output is pretty bad, not even close to 5.6% error mentioned. I’d say more like 50% error on a simple phrase.
Are there other things to consider?

1 Like

What was the original format of clear_test.wav? (Frequency of sample, bit depth, number of channels…) Also, could you provide a link to clear_test.wav and clear_test_b16_c1_r16k.wav.

Thanks for answering @kdavis Here’s links and info:

Converted file clear_test_b16_c1_r16k.wav:
https://drive.google.com/open?id=1w7Xb7kpVJ6tauWrMlNGILUp0-lTLzHMI

original clear_test.wav:
https://drive.google.com/open?id=1AP2-vySoOREeV6z2LM2GIStRNXs2Lqw3

play info for converted:

data/bbt_talking/clear_test_b16_c1_r16k.wav:

 File Size: 350k      Bit Rate: 256k
  Encoding: Signed PCM    
  Channels: 1 @ 16-bit   
Samplerate: 16000Hz      
Replaygain: off         
  Duration: 00:00:10.94

play info for original:

data/bbt_talking/clear_test.wav:

 File Size: 3.86M     Bit Rate: 2.82M
  Encoding: Signed PCM    
  Channels: 2 @ 32-bit   
Samplerate: 44100Hz      
Replaygain: off         
  Duration: 00:00:10.94 

What is the desired text output and what did the network produce?

expected output:
"Hi mom, this is Brett, I am speaking as clearly as possible and as slowly as possible. I hope you get this.

network output:

$ deepspeech models/output_graph.pb data/bbt_talking/clear_test_b16_c1_r16k.wav models/alphabet.txt models/lm.binary models/trie 
Loading model from file models/output_graph.pb
2018-02-13 16:25:19.554660: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.945s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 5.815s.
Running inference.
i am on this as breath im speaking is clearly it is possible and is slowly as possible i hope you get his

I tried it locally, and I’m getting the same results as you, a good thing. So at least we are starting from the same point. I’ll try and take a look at the problem, but I’ll likely not get time until tomorrow.

Excellent that results are the same, I agree that helps a lot.

Thanks very much for any feedback you can give.
-Brett

One quick question:

Is it possible for you to record directly to signed PCM, single channel, 16-bit, 16KHz audio instead of first recording in a different format then converting?

I’m just curious if there are artifacts of the conversion that we don’t hear, but the machine “hears”.

Hi, sorry for delay, it had to wait until the weekend. Recording directly to 16-bit, single channel 16K Hz seems to have performed far better.

$ soxi recording.wav

Input File     : '/recording.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.00 = 64000 samples ~ 300 CDDA sectors
File Size      : 128k
Bit Rate       : 256k

I recorded that same sample of text and only note three small errors now:

“my mom this is bread i am speaking as clearly as possible and of slowly as possible i hope you get this”

And one of those is in the personal noun, always hard, even for humans.

Great! So seems like artifacts of conversion make a big difference. Thanks!

2 Likes

Glad to be of help, and thanks for helping with the debugging!

What can we do to reduce these conversion artifacts so that recognition still performs reasonably? Many uses will require conversion of audio sources so this would be nice to figure out.

That’s a good question, but nobody on the team has time for that right now, so any experience and feedback on that is welcome :). Maybe some denoising, or low-pass or high-pass filtering might help?

I’m getting substantially better performance by applying a 200-3000Hz bandpass filter in the conversion. Not sure what you folks prefer to use, but the following one liners seem to do the job with ffmpeg or sox:
ffmpeg -i input.wav -acodec pcm_s16le -ac 1 -ar 16000 -af lowpass=3000,highpass=200 output.wav
sox input.wav -b 16 output.wav channels 1 rate 16k sinc 200-3k

For reference, the source files I’m using are snippets of noisy air traffic control conversations from liveatc.net

3 Likes

Nice! Thanks for experimenting and reporting your results back.

Thanks @alex.n.james, I was having the same problem and the bandpass filter worked like a charm! Should this be in the README or documentation (or maybe it is aleady)? The difference is just astonishing.

Any PR that improve doc is welcome, so feel free.