Single Sentence Record Limit feature release

Christos · June 18, 2020, 3:14pm

Keeping our commitment towards improving our Dataset’s Quality, last week we launched a new feature for decreasing data repetition.

The problem

Common Voice previously allowed people to record the same sentences when new sentence queues were exhausted, resulting in data repetition. Research has shown that the majority of people using the Common Voice dataset prefer one recording per sentence when training models. This provides lower word error rates compared to using redundant clips.

The solution

As of last week, the Common Voice platform is beginning to limit recordings to one validated clip per sentence across all languages. This is the first implementation step for this feature and will ensure that recording repetitions are phased out as voice contribution enters the second half of 2020.

To minimize disruption to the contributor experience, we have decided to gradually backfill the data needed to make this determination. Updates will be triggered every time there is a new recording or a new validation. This means that you may see a divergence in clips recorded compared to clips available to be validated each day. The act of recording or validating a sentence that already has a validated clip will remove all other clips for this sentence from the validation queue.

For some languages, this will be less noticeable than others, depending on each language community’s contribution cadence and the proportion of new sentences available for recording. We are working on an interim migration that will address the languages with the least amount of new sentences available. In the meantime, if you are noticing a significant discrepancy, that is an indicator that your language is running low on sentences to record, and we encourage you to refer to this post to see how you can help bring more sentences to the platform in your language.

The team is also aware that not all languages have a large enough contributor base to produce the 2,000 valid voice hours needed to train an STT model. We are planning on creating exemptions to this limit for languages with a smaller contributor population where it’s not realistic to achieve 2,000 valid voice hours, and at least 1k speakers, in the near future. This will allow languages to create datasets that mature in both size and quality as contribution grows. We’ll share more details on this exemption when it is ready.

We have worked closely with our Deep Speech colleagues and various machine learning experts to confirm that limiting sentence and clip repetition is the right approach for improving the quality of our data.

Common Voice follows the principles of Open by Design and all the technical details regarding the implementation of this solution can be found in this pull request on GitHub.

If you have any questions, comments, or suggestions, please reply to this post.

Thank you!

Christos and the Common Voice team

Okki · June 18, 2020, 8:23pm

And what happens to these clips? They’ll never be validated and don’t end up in the dataset?

It seemed to me that the dataset could be used for many projects, not just the creation of a STT model. Some may be interested in several clips of the same text. Or they might be interested only in female voice clips.

On the French version, one man recorded several tens of thousands of clips. If the system privileges his clips, just because he is first in the queue, rather than those of speakers who have recorded only a small number of clips, we lose in diversity.

It would therefore be preferable to offer a dataset with absolutely all valid clips, and to sort afterwards, according to the needs of each project.

cjbaker · June 18, 2020, 11:08pm

Thanks for opening this subject for discussion.

I agree that it’s important to do everything possible to maximize diversity of prompts and speakers, but as Okki said, there are indeed some possible uses for duplicate clips. I am working on such a project now, related to foreign language study, but I don’t presume this is common.

When a language’s unique prompts are depleted, could we provide some notice to the users of the critical need for new sentences, but continue recording those with the fewest duplicates? And maybe duplicate clips could always be at the bottom of the review queue? In other words, provide clear information about what’s going on, and always prioritize unique clips, but don’t completely block duplicates.

Thank you,

Craig

irvin · June 19, 2020, 6:21am

There are many STT models, some of them are similar to DeepSpeech that benefit from one clip for each sentences, but some are benefit from as many as possible regardless of repeated.

One of the research directions for Chinese STT recognization is to create “the smallest best corpus to including all potential pronunciations” (that’s about 1500 of them), and ask every participant to record all of them. This can use to train the computer to learn how different people pronounce the same character. I had blog about this approach toward STT training which can apply on logogram and syllabary languages such as Chinese (and all of its various languages), Japanese and others.

That is also how our sentences collecting efforts on Mandarin (Taiwan) and Cantonese emphasis for now, to ensure the corpus cover all of the characters and pronunciations (so far we had covered about 63.31% of pronunciations on Taiwan Mandarin corpus)

This feature to limit one clip per sentence will restrict the Common Voice database on such kind of researching and developing.

The ideal scenario is that we both 1) make sure each sentences from corpus had been recorded and validated at least one time, by 2) encourage people to record as much as possible but prioritize new sentences when recording and 3) substrate a special tsv for DeepSpeech’s requirement that no repeat sentences included.

stergro · June 19, 2020, 7:47am

Generally I am okay with this approach, it could even be a chance to motivate people to collect more sentences. But three thoughts on this:

Please don’t delete existing duplicates
One main problem I see is that people skip incorrect sentences or report them, but they do not disappear from the queue. When a language runs out of sentences only these incorrect sentences will be left in the queue. So if you force small language to have no repetitions, then you should also implement a process that really removes reported sentences from the list of recordable sentences instead of just writing them in a list that someone has to check manually.
There are tons of very similar sentences in the dataset already that could count as duplicates. Fore example sentences that only differ in one word or sentences that end with a question mark instead of a full stop. Especially the sentences from Wikipedia often have a very similar structure.

lissyx · June 19, 2020, 9:31am

Can’t speak for the current status after this change, but this is what was done in previous Common Voice releases: validated.tsv contains much more than train.tsv, allowing to re-generate the data with duplicated recordings.

Adrijaned · June 20, 2020, 11:30pm

I’ll have to say too to not just delete duplicate recordings of sentences. Move them to the end of review queue and disallow their rerecording for all I care, but just deleting them feels disrespectful to the work of volunteers who donated them, if nothing else. Especially since what’s best for majority of uses isn’t best for all.

Mte90 · June 22, 2020, 9:28am

This is true also for us (Italy) where exists dataset of the same sentence read from different people from different region to analyze the various different accents.

This is kind of true again.

I agree totally!

If the focus is to not just “hey this is a list of sentences with their audio” there is no reasons to discuss about it.
If the discussion instead is about “create the most complete list of recordings to maximize the volunteers effort” we can talk it.

Maximize the volunteers effort will be just let the recordings multiple for a same sentence maybe just a specific ratio, like 3.
I think that this problem exists also in english that is spoken by a lot of people that isn’t a native speaker.

Anyway thanks to remember that this project follows the open by design philosophy because sometimes this is not clear or easy to see in Mozilla.

PS: maybe is time to let volunteers see the list of sentences reported just to improve the sentences itself if the project is working to improve the quality overall otherwise is like a “colander” where just one hole is closed and the rest all opened.

phirework · June 22, 2020, 5:08pm

Hi all, thanks for the detailed and thoughtful discussion here. We are still refining our approach towards how to prioritize recording and validation in a way that maximizes the efficiency of our volunteers’ efforts, and this is merely one step in that long process. For now, I just wanted to clarify two things:

Each dataset we release includes all voice data that the platform has collected, including clips that have not yet made it through the validation queue and clips that have been rejected by the validation queue. Any repeated voice clips, even if they have not been validated by volunteers, will continue to be available in the dataset.
We do not ever delete data from contributors, except when individuals request deletion of their own personal contributions. We take the trust and time of our volunteers very seriously, and we completely agree that simply discarding hundreds of hours of effort would be incredibly disrespectful.

We are working on including a list of reported sentences in the next dataset release, which will hopefully give you all a better sense of how to further refine collection efforts, and we totally hear you on needing to integrate that into the validation queue logic. Thanks again for all your feedback.

lissyx · June 22, 2020, 5:21pm

Through validated.tsv as it is right now, is that right?

phirework · June 22, 2020, 5:25pm

Repetitions that have already been validated will be in validated.tsv, yes. Repetitions that have not received enough votes to be validated yet will be in other.tsv.

irvin · December 2, 2020, 5:51am

The restriction had been a limit for some locales during the campaign and local promotion [eg. *]. For now, we don’t have enough people working on bringing new sentences to the site sooner, and probably are more eased from our own requirement of no-repeat sentences (which is limit from DeepSpeech) on the database.

Should we re-enable for sentences to re-recording if they had all been record once?

It would be very easy for the researcher to remove the duplicate sentences from downloaded db, but it’s very hard for us to keep enough sentences online for people to recording.

Time needed after adding sentences to the sentence collector to start using them?

baghdadi.mr · December 2, 2020, 9:20am

I fully agree with you.

irvin · December 5, 2020, 9:04am

Second case on Chuvash

I believe this restriction is harming for the project. The successfully the local promotion the harder the situation.

Imaging that one locale finally got a media coverage, many people read the story and came to common voice, and find that recording is out of work.

Consider the current resources that we can bring sentences to website more frequently, it’s even harder for both us and local team than June.

irvin · December 5, 2020, 4:40pm

issue filed by spectie: https://github.com/mozilla/common-voice/issues/2948

mh5 · June 4, 2022, 11:20pm

The German Dataset has too many duplicates and some sentences repeat more than 20 times !

It is a waste of time for contributors to pronounce the same sentence > 20 times.

For example:

536782: 93: common_voice_de_18497379.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 605869: 93: common_voice_de_18210288.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 623969: 93: common_voice_de_18138529.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 631232: 93: common_voice_de_18099979.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 648405: 93: common_voice_de_17993631.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 663098: 93: common_voice_de_17879514.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 674393: 93: common_voice_de_17814622.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 676425: 93: common_voice_de_17805078.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 683024: 93: common_voice_de_17779875.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 685180: 93: common_voice_de_17772181.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 696755: 93: common_voice_de_17717845.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 704072: 93: common_voice_de_17682161.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 709907: 93: common_voice_de_17663250.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 710988: 93: common_voice_de_17660577.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 713426: 93: common_voice_de_17655085.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 730375: 93: common_voice_de_17543721.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 734178: 93: common_voice_de_17504008.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 735969: 93: common_voice_de_17483698.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 740153: 93: common_voice_de_17429144.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 746123: 93: common_voice_de_17361912.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 750737: 93: common_voice_de_17342523.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 755862: 93: common_voice_de_17330584.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus 765595: 93: common_voice_de_17299408.opus,Gerade zu Beginn des Studiums ist es wichtig, sich mit seinen Kommilitonen zu vernetzen..opus

bozden · June 6, 2022, 1:51am

Re-recording happens when the text corpus falls behind. It might have happened in the past. At the very beginning of the project, a starter set of sentences have been added and they might have been recorded multiple times, this is true for many datasets, although not this much…

For analysis, you can download all past datasets and check them for the distribution of multiple recordings per sentence. As a simple calculation: If the average recording length is 3.6 sec, for 1 hour recording you need 1000 distinct sentences (if you want single recording per sentence). So, as German is 1000+ hours, it should have 1 million+ sentences in the text corpus (does it?).

To prevent that, you have to feed the Sentence Collector with new CC0 sources. Common Voice gets “least recorded” sentences from the database and feeds them randomly to the new recording sessions. So it is not probable that these are re-recorded unless everything is recorded 20+ times.

As a side note: Although Common Voice CorporaCreator takes one recording per sentence for default splits, it might be better to create custom splits where you take 2 or 3 recordings per sentence (and train & test them). The model can improve with sentences spoken by different people (e.g. male/female, young/elderly, accents, etc).

irvin · June 6, 2022, 5:45pm

We learn how different people pronounce the same words through this. It would be waste of time only if the sentences were recorded by the same people. Duplicate sentences may be useless in one model (such as DeepSpeech) but could be useful for the other.

stergro · June 13, 2022, 10:41am

A good first step to motivate people to add more sentences would be to show the number of sentences and the rate of records/sentence as a statistic on https://commonvoice.mozilla.org/en/languages

The number of sentences is already visible on the languages in the “on progress” section, I never understood why this number is missing in the “launched” section.

The number of sentences is also not part of the statistics of the Datasets on https://commonvoice.mozilla.org/en/datasets .

EDIT: oh nice, looks like this is being implemented now.