DeepSpeech accuracy data for librispeeh

(Zhao, Xiaohui) #1

Excellent work for the good WER accuracy in this release (5.66%
on librispeech test-clean).

We are trying to repeat the release accuracy, but we found that the training dataset is Fisher+Librispeech+SwitchBoard, while Fisher+SwitchBoard datasets are not free of charge.

So my question is

Do you have the accuracy data on librispeech? (trained on 960hour, tested on test-clean/test-other)

Could you provide the WER accuracy with and without the language model?


BTW, i found the 12% WER accuracy from this issue, but i suppose there maybe accuracy update for it.

Best Regards.


(kdavis|PTO) #2


I just ran a benchmark training the current master on all of librispeech train (clean + other) and running a test on all of librispeech test (clean + other) and got a WER of 20.6%[1], not an amazing result.

This results was without tuning any hyperparameters to librispeech and using the same hyperparameters as in the release run that trained on Fisher+Librispeech+SwitchBoard.

(Zhao, Xiaohui) #3

Thanks for your response.

I understand the 20.6% WER accuracy is acceptable considered the absence of language model, and pytorch got a very similar WER(21%) on DeepSpeech2.

Could you update the best WER accuracy on librispeech after the typerparameters tunning ?

FYI, baidu published the
, and we could get a better accuracy with RNN-transducer or Attention without any language model.

Best Regards.


(kdavis|PTO) #4

Unfortunately training takes time and our servers are booked running benchmarks for our new release. So we don’t really have time to train a model on Librispeech and also tune the associated hyperparameters.

On DeepSpeech3, one of the main reasons RNN-transducer or Attention bested CTC was that they trained on lots of data. RNN-transducer and Attention were able to create an implicit language model. Librispeech is not a large data set. So I doubt the results could be matched using only Librispeech.

(Zhao, Xiaohui) #5

Agree with you about the librispeech dataset for DeepSpeech3, and you got the Fisher+SwitchBoard dataset, so you have the chance to get the paper accuracy on DeepSpeech3.

BTW, do you know the exact LDC number for Fisher and SwitchBoard dataset?

I got the following result but my result does not match with the paper.


(kdavis|PTO) #6

Unfortunately, we haven’t had a chance to repo the DeepSpeech3 results. However, we have two open issues to do so

But we don’t have the human/compute resources to tackle them now.

As for the data sets, the Fisher data set is from LDC2004T19, LDC2004S13, LDC2005T19, and LDC2005S13. The SwitchBoard data set is from LDC97S62.