Performance for native English speaker with pre-trained model?

btofel · February 13, 2018, 9:48pm

Hi, I am just trying to get an initial feel for DeepSpeech. How it might work against my voice. So I did the following:
installed deepspeech with pip on my MacBook;
recorded an audio of me talking as clearly and slowly as possible and converted it like this:
sox data/bbt_talking/clear_test.wav -r 16000 -b 16 -c 1 data/bbt_talking/clear_test_b16_c1_r16k.wav
ran deepspeech with the pre-trained English model like this:
deepspeech models/output_graph.pb data/bbt_talking/clear_test_b16_c1_r16k.wav models/alphabet.txt models/lm.binary models/trie
The text output is pretty bad, not even close to 5.6% error mentioned. I’d say more like 50% error on a simple phrase.
Are there other things to consider?

kdavis · February 14, 2018, 5:57am

What was the original format of clear_test.wav? (Frequency of sample, bit depth, number of channels…) Also, could you provide a link to clear_test.wav and clear_test_b16_c1_r16k.wav.

btofel · February 14, 2018, 2:36pm

Thanks for answering @kdavis Here’s links and info:

Converted file clear_test_b16_c1_r16k.wav:
https://drive.google.com/open?id=1w7Xb7kpVJ6tauWrMlNGILUp0-lTLzHMI

original clear_test.wav:
https://drive.google.com/open?id=1AP2-vySoOREeV6z2LM2GIStRNXs2Lqw3

play info for converted:

data/bbt_talking/clear_test_b16_c1_r16k.wav:

 File Size: 350k      Bit Rate: 256k
  Encoding: Signed PCM    
  Channels: 1 @ 16-bit   
Samplerate: 16000Hz      
Replaygain: off         
  Duration: 00:00:10.94

play info for original:

data/bbt_talking/clear_test.wav:

 File Size: 3.86M     Bit Rate: 2.82M
  Encoding: Signed PCM    
  Channels: 2 @ 32-bit   
Samplerate: 44100Hz      
Replaygain: off         
  Duration: 00:00:10.94

kdavis · February 14, 2018, 2:54pm

What is the desired text output and what did the network produce?

btofel · February 14, 2018, 4:19pm

expected output:
"Hi mom, this is Brett, I am speaking as clearly as possible and as slowly as possible. I hope you get this.

network output:

$ deepspeech models/output_graph.pb data/bbt_talking/clear_test_b16_c1_r16k.wav models/alphabet.txt models/lm.binary models/trie 
Loading model from file models/output_graph.pb
2018-02-13 16:25:19.554660: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.945s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 5.815s.
Running inference.
i am on this as breath im speaking is clearly it is possible and is slowly as possible i hope you get his

kdavis · February 14, 2018, 4:43pm

I tried it locally, and I’m getting the same results as you, a good thing. So at least we are starting from the same point. I’ll try and take a look at the problem, but I’ll likely not get time until tomorrow.

btofel · February 14, 2018, 4:50pm

Excellent that results are the same, I agree that helps a lot.

Thanks very much for any feedback you can give.
-Brett

kdavis · February 15, 2018, 6:42am

One quick question:

Is it possible for you to record directly to signed PCM, single channel, 16-bit, 16KHz audio instead of first recording in a different format then converting?

I’m just curious if there are artifacts of the conversion that we don’t hear, but the machine “hears”.

btofel · February 17, 2018, 8:42pm

Hi, sorry for delay, it had to wait until the weekend. Recording directly to 16-bit, single channel 16K Hz seems to have performed far better.

$ soxi recording.wav

Input File     : '/recording.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.00 = 64000 samples ~ 300 CDDA sectors
File Size      : 128k
Bit Rate       : 256k

I recorded that same sample of text and only note three small errors now:

“my mom this is bread i am speaking as clearly as possible and of slowly as possible i hope you get this”

And one of those is in the personal noun, always hard, even for humans.

Great! So seems like artifacts of conversion make a big difference. Thanks!

kdavis · February 17, 2018, 8:50pm

Glad to be of help, and thanks for helping with the debugging!

alex.n.james · February 18, 2018, 6:00am

What can we do to reduce these conversion artifacts so that recognition still performs reasonably? Many uses will require conversion of audio sources so this would be nice to figure out.

lissyx · February 18, 2018, 10:52am

That’s a good question, but nobody on the team has time for that right now, so any experience and feedback on that is welcome :). Maybe some denoising, or low-pass or high-pass filtering might help?

alex.n.james · February 18, 2018, 9:31pm

I’m getting substantially better performance by applying a 200-3000Hz bandpass filter in the conversion. Not sure what you folks prefer to use, but the following one liners seem to do the job with ffmpeg or sox:
• ffmpeg -i input.wav -acodec pcm_s16le -ac 1 -ar 16000 -af lowpass=3000,highpass=200 output.wav
• sox input.wav -b 16 output.wav channels 1 rate 16k sinc 200-3k

For reference, the source files I’m using are snippets of noisy air traffic control conversations from liveatc.net

kdavis · February 19, 2018, 5:13am

Nice! Thanks for experimenting and reporting your results back.

darfunkel · July 12, 2019, 4:03pm

Thanks @alex.n.james, I was having the same problem and the bandpass filter worked like a charm! Should this be in the README or documentation (or maybe it is aleady)? The difference is just astonishing.

lissyx · July 12, 2019, 6:02pm

Any PR that improve doc is welcome, so feel free.