Effect of various improvements on accuracy

michaelklachko · July 23, 2018, 6:25pm

Hello, I noticed a few stale branches where things have been tried, such as batch normalization, dilated convolutions, or filter banks instead of mfcc. I’m curious - what was the impact of those features? If there were improvements, why haven’t those features made it into the main branch?
Also, why are you using DeepSpeech1 architecture, and not DeepSpeech2?

Finally, was the 6.5% accuracy on LibriSpeech achieved with the architecture as is currently specified in master branch (that is, DS1 model using LSTM cells)?

lissyx · July 25, 2018, 7:35am

As you can deduce, if we have not merged those, it’s because the results were not good / good enough

lissyx · July 25, 2018, 7:35am

I’ll let @kdavis reply on this one

lissyx · July 25, 2018, 7:36am

Yes, that’s right, this level of accuracy was achieved using LSTM cells on DeepSpeech v1 model.

michaelklachko · July 25, 2018, 5:57pm

That’s interesting. DS2 paper mentions the WER on LS test-clean has improved from 7.89% to 5.33%. I assume that is mostly due to using batchnorm, three 2D conv. layers, and 7 GRU layers.

Your result of 6.5% is especially impressive because (I assume) you didn’t have access to the 12k hours Baidu used for training. What was the total number of hours in your training dataset (used to reach 6.5% accuracy)?

lissyx · July 25, 2018, 6:29pm

I think it was somewhere close, it’s a mix a different dataset, I don’t remember the amount of each. Maybe @kdavis remembers?