Questions about speech corpora for pre-trained model


I had some questions about the pre-trained model for 0.4.1.

  1. How many hours of data in total were used to train the pre-trained model?

  2. What are the proportions of each speech corpus used? i.e. is it mainly LibriSpeech, Common Voice or an even mix of all of them?

  3. It says that the model is optimized for American English but that it uses the English Common Voice corpus, so presumably this isn’t filtered first and thus contains all English accents?

(kdavis) #2

This info is also in our release notes here. Particulary…

  1. 2000 (Fisher) + 260 (Switchboard) + 1000 (Librispeech) + 600 approx (Common Voice) = 3860 hours
  2. It uses all of the above corpora
  3. Yes it contains all accents, but so does American English. Particularly, it’s dominated by Fisher which is dominated by American English, see the links above.


Thanks! I asked because I read the release notes before asking and that information isn’t listed there, but it would be great if it was in future.