We have launched Romansh sursilvan with the 5000 given sentences. Now: Is every sentence only read once and by one person? Or put on another way: If 100 people read 50 sentences, there are no more sentences to read? Or can the same 100 people read all the 5000 sentences (what I finally hope is the case
Thanks for a short reply.
Right now there is no limitation on how many times a sentence can be read by a person.
BUT, DeepSpeech only uses one recording per sentence, because using more reduces the quality of the models, and that’s why we are evaluating restricting this is the near future (we’ll post soon with more details about this).
We might be able to avoid this restriction to languages where it’s not realistic due their size to have enough data to train these models in the near future.
Many thanks for your answer.
I was afraid, we wouldn’t have enough sentences for the 72 hours-project we launch the 16th until the 19th January 2020. Now I can be a bit more relaxed about it.
This could be essential for us to succeed in the future. Thanks again.
Is this the case for every deep learning algorithm or could repetitions be useful for other models? I know that some people use the dataset to test the error rate of their finished system, at least for this use case repetitions will do no harm.
That’s what I thought it could be useful for.
We are optimizing for DeepSpeech training, we’ll follow-up with more details in this one soon, we haven’t taken a final decision yet.
I see, totally understandable that you want to focus on one system in this project. I would like to understand overfitting a little better. Does it only happen with identical sentences or are similar sentences a problem too? For example I often see sentences like this in the collections:
He lives in London.
He lives in London?
He lives in london.
He lives in London in a small street.
At least the first three sentences could be seen as de-facto doublicates in the dataset. Should we aim to avoid this or is this okay?
@reuben might have better answer here, since this is about the model training and deep speech.
It’s not about a similarity threshold, real language has similar sentences so it’s not necessarily a problem. We had to cut duplicates from Common Voice English because it had really high rates of duplication, like the same sentence being repeated thousands of times. This can throw off the learning algorithm and hurt the quality of the models. The pragmatic answer is that it is much, much easier to collect text than it is to collect voice recordings. So it’s better to have a little more effort upfront in collecting a lot of unique sentences. Otherwise you risk doing an expensive collection and ending up with a dataset that is not as useful as it could be.
Here’s a CC0 text dataset with 160 languages, for example: https://traces1.inria.fr/oscar/
Okay I see makes sense to me. Thanks for your link to the OSCAR Corpus, it even has a huge collection in Esperanto. I will have a closer look into this for German too.
You talked about English with thousands of repetitions but what about smaller numbers like recent user spike in Polish that resulted in recording 30+ hours of material for ~10 hours worth of text? Is that in any way useful?
Having duplicates isn’t the most efficient way to collect data, but I would never see having extra data as a bad thing as it gives you more options. That data could still be useful in some situations, such as fine-tuning for specific regional accents.