Do the Common Voice datasets contain multiple audio samples for the same text in the same language?

nukeador · March 6, 2020, 4:47pm

The current tests that we have done suggest that English will need around 2000 validated hours for a basic general model to be trained. Different languages will most likely have different numbers, we will only be able to know as we collect data and train models.

For major languages we have been collecting 1-3M sentences to have buffer to avoid repetitions. On the tests that Deep Speech have been doing for English, French, German and Mandarin, the quality of the model was higher when only one recording was done.

There were some math over this topic: