Do the Common Voice datasets contain multiple audio samples for the same text in the same language?

Do the Common Voice datasets contain multiple audio samples for the same text in the same language?

Would spare me quite some GBs to download. Thanks!

Yes, there are some languages that have more than one recording per sentence.

Thanks! Would that be the case for the English dataset? And is it the way the recording is designed to have multiple recordings for every sentence or is it just arbitrary?

English has some repetitions, but we have been working with languages to make sure they have enough sentences so we keep recordings of the same sentences to the minimum, ideally just one recording per sentence.

Thanks again. I’m looking for a dataset that has english speech audio samples of the same sentence spoken by different speakers. Is it possible to obtain these via Common Voice?

You can download the English dataset and check the index file, order by sentence and pick the ones with more than one recording.

Please, let us know if this is useful, we would like to know how people are using our dataset :slight_smile:

If ideally it should be one recording per sentence and we need 10.000 hours recorded, how many sentences should we prepare in each language? If we suppose each recording lasts 10 seconds, ideally we need 3,6 million sentences on each language?

If such a number isn’t affordable for an small language (for example Basque), how many recordings and sentences could give us an usable speech-to-text system? I don’t ask for an exact number, of course, just a very approximated one, to clarify which are our objectives.

At this moment, there are 6.212 Basque sentences, so if I’m right, ideally we shouldn’t record more than 17 hours for now. If more recordings are done, the 638 speakers could be wasting their time, because that work is not really going to help to create a better speech-to-text system. Is this right?

We already recorded 99 hours, many people has recorded the same sentences and the volunteers keep recording and validating the same material time after time. Should we recommend Basque contributors to stop making new recordings until more sentences are available?

The current tests that we have done suggest that English will need around 2000 validated hours for a basic general model to be trained. Different languages will most likely have different numbers, we will only be able to know as we collect data and train models.

For major languages we have been collecting 1-3M sentences to have buffer to avoid repetitions. On the tests that Deep Speech have been doing for English, French, German and Mandarin, the quality of the model was higher when only one recording was done.

There were some math over this topic:


It turned out that there are many audio samples with the same sentence in the English dataset. I believe there are 16k sentences that have at least 5 or more recordings. However, I have not yet checked if all of them are unique.

Here is an example of a sentence recorded multiple times.¤