Is every sentence only read once and by one person?

Conradin · December 12, 2019, 4:55pm

Hi there
We have launched Romansh sursilvan with the 5000 given sentences. Now: Is every sentence only read once and by one person? Or put on another way: If 100 people read 50 sentences, there are no more sentences to read? Or can the same 100 people read all the 5000 sentences (what I finally hope is the case
Thanks for a short reply.

nukeador · December 13, 2019, 12:31pm

Right now there is no limitation on how many times a sentence can be read by a person.

BUT, DeepSpeech only uses one recording per sentence, because using more reduces the quality of the models, and that’s why we are evaluating restricting this is the near future (we’ll post soon with more details about this).

We might be able to avoid this restriction to languages where it’s not realistic due their size to have enough data to train these models in the near future.

Conradin · December 13, 2019, 1:17pm

Many thanks for your answer.
I was afraid, we wouldn’t have enough sentences for the 72 hours-project we launch the 16th until the 19th January 2020. Now I can be a bit more relaxed about it.

This could be essential for us to succeed in the future. Thanks again.

stergro · December 13, 2019, 2:31pm

Is this the case for every deep learning algorithm or could repetitions be useful for other models? I know that some people use the dataset to test the error rate of their finished system, at least for this use case repetitions will do no harm.

Conradin · December 13, 2019, 2:46pm

That’s what I thought it could be useful for.

nukeador · December 13, 2019, 4:08pm

We are optimizing for DeepSpeech training, we’ll follow-up with more details in this one soon, we haven’t taken a final decision yet.

stergro · December 14, 2019, 10:22am

I see, totally understandable that you want to focus on one system in this project. I would like to understand overfitting a little better. Does it only happen with identical sentences or are similar sentences a problem too? For example I often see sentences like this in the collections:

He lives in London.
He lives in London?
He lives in london.
He lives in London in a small street.
…

At least the first three sentences could be seen as de-facto doublicates in the dataset. Should we aim to avoid this or is this okay?

nukeador · December 16, 2019, 2:03pm

@reuben might have better answer here, since this is about the model training and deep speech.

reuben · December 16, 2019, 6:44pm

It’s not about a similarity threshold, real language has similar sentences so it’s not necessarily a problem. We had to cut duplicates from Common Voice English because it had really high rates of duplication, like the same sentence being repeated thousands of times. This can throw off the learning algorithm and hurt the quality of the models. The pragmatic answer is that it is much, much easier to collect text than it is to collect voice recordings. So it’s better to have a little more effort upfront in collecting a lot of unique sentences. Otherwise you risk doing an expensive collection and ending up with a dataset that is not as useful as it could be.

Here’s a CC0 text dataset with 160 languages, for example: https://traces1.inria.fr/oscar/

stergro · December 16, 2019, 6:51pm

Okay I see makes sense to me. Thanks for your link to the OSCAR Corpus, it even has a huge collection in Esperanto. I will have a closer look into this for German too.

Etua · January 23, 2020, 12:09am

You talked about English with thousands of repetitions but what about smaller numbers like recent user spike in Polish that resulted in recording 30+ hours of material for ~10 hours worth of text? Is that in any way useful?

dabinat · January 23, 2020, 5:46am

Having duplicates isn’t the most efficient way to collect data, but I would never see having extra data as a bad thing as it gives you more options. That data could still be useful in some situations, such as fine-tuning for specific regional accents.

Topic		Replies	Views
Single Sentence Record Limit feature release Common Voice announcements	18	3114	June 13, 2022
Sentences analysis on main languages - Action needed for the ones with deficit Common Voice sentence-collection	14	1969	August 6, 2019
Do the Common Voice datasets contain multiple audio samples for the same text in the same language? Common Voice dataset	9	2240	April 20, 2020
Why train.tsv includes a few files (just 3% of validated set)? Common Voice dataset	21	6086	February 26, 2020
My language is now collecting voice, what do I need to know? Common Voice participation	2	4096	July 1, 2020

Is every sentence only read once and by one person?

Related topics