Effect of various improvements on accuracy

Hello, I noticed a few stale branches where things have been tried, such as batch normalization, dilated convolutions, or filter banks instead of mfcc. I’m curious - what was the impact of those features? If there were improvements, why haven’t those features made it into the main branch?
Also, why are you using DeepSpeech1 architecture, and not DeepSpeech2?

Finally, was the 6.5% accuracy on LibriSpeech achieved with the architecture as is currently specified in master branch (that is, DS1 model using LSTM cells)?

As you can deduce, if we have not merged those, it’s because the results were not good / good enough :slight_smile:

I’ll let @kdavis reply on this one

Yes, that’s right, this level of accuracy was achieved using LSTM cells on DeepSpeech v1 model.

That’s interesting. DS2 paper mentions the WER on LS test-clean has improved from 7.89% to 5.33%. I assume that is mostly due to using batchnorm, three 2D conv. layers, and 7 GRU layers.

Your result of 6.5% is especially impressive because (I assume) you didn’t have access to the 12k hours Baidu used for training. What was the total number of hours in your training dataset (used to reach 6.5% accuracy)?

I think it was somewhere close, it’s a mix a different dataset, I don’t remember the amount of each. Maybe @kdavis remembers?