Single Sentence Record Limit feature release

Keeping our commitment towards improving our Dataset’s Quality, last week we launched a new feature for decreasing data repetition.

The problem

Common Voice previously allowed people to record the same sentences when new sentence queues were exhausted, resulting in data repetition. Research has shown that the majority of people using the Common Voice dataset prefer one recording per sentence when training models. This provides lower word error rates compared to using redundant clips.

The solution

As of last week, the Common Voice platform is beginning to limit recordings to one validated clip per sentence across all languages. This is the first implementation step for this feature and will ensure that recording repetitions are phased out as voice contribution enters the second half of 2020.

To minimize disruption to the contributor experience, we have decided to gradually backfill the data needed to make this determination. Updates will be triggered every time there is a new recording or a new validation. This means that you may see a divergence in clips recorded compared to clips available to be validated each day. The act of recording or validating a sentence that already has a validated clip will remove all other clips for this sentence from the validation queue.

For some languages, this will be less noticeable than others, depending on each language community’s contribution cadence and the proportion of new sentences available for recording. We are working on an interim migration that will address the languages with the least amount of new sentences available. In the meantime, if you are noticing a significant discrepancy, that is an indicator that your language is running low on sentences to record, and we encourage you to refer to this post to see how you can help bring more sentences to the platform in your language.

The team is also aware that not all languages have a large enough contributor base to produce the 2,000 valid voice hours needed to train an STT model. We are planning on creating exemptions to this limit for languages with a smaller contributor population where it’s not realistic to achieve 2,000 valid voice hours, and at least 1k speakers, in the near future. This will allow languages to create datasets that mature in both size and quality as contribution grows. We’ll share more details on this exemption when it is ready.

We have worked closely with our Deep Speech colleagues and various machine learning experts to confirm that limiting sentence and clip repetition is the right approach for improving the quality of our data.

Common Voice follows the principles of Open by Design and all the technical details regarding the implementation of this solution can be found in this pull request on GitHub.

If you have any questions, comments, or suggestions, please reply to this post.

Thank you!

Christos and the Common Voice team

4 Likes

And what happens to these clips? They’ll never be validated and don’t end up in the dataset?

It seemed to me that the dataset could be used for many projects, not just the creation of a STT model. Some may be interested in several clips of the same text. Or they might be interested only in female voice clips.

On the French version, one man recorded several tens of thousands of clips. If the system privileges his clips, just because he is first in the queue, rather than those of speakers who have recorded only a small number of clips, we lose in diversity.

It would therefore be preferable to offer a dataset with absolutely all valid clips, and to sort afterwards, according to the needs of each project.

Thanks for opening this subject for discussion.

I agree that it’s important to do everything possible to maximize diversity of prompts and speakers, but as Okki said, there are indeed some possible uses for duplicate clips. I am working on such a project now, related to foreign language study, but I don’t presume this is common.

When a language’s unique prompts are depleted, could we provide some notice to the users of the critical need for new sentences, but continue recording those with the fewest duplicates? And maybe duplicate clips could always be at the bottom of the review queue? In other words, provide clear information about what’s going on, and always prioritize unique clips, but don’t completely block duplicates.

Thank you,

Craig

1 Like

There are many STT models, some of them are similar to DeepSpeech that benefit from one clip for each sentences, but some are benefit from as many as possible regardless of repeated.

One of the research directions for Chinese STT recognization is to create “the smallest best corpus to including all potential pronunciations” (that’s about 1500 of them), and ask every participant to record all of them. This can use to train the computer to learn how different people pronounce the same character. I had blog about this approach toward STT training which can apply on logogram and syllabary languages such as Chinese (and all of its various languages), Japanese and others.

That is also how our sentences collecting efforts on Mandarin (Taiwan) and Cantonese emphasis for now, to ensure the corpus cover all of the characters and pronunciations (so far we had covered about 63.31% of pronunciations on Taiwan Mandarin corpus)

This feature to limit one clip per sentence will restrict the Common Voice database on such kind of researching and developing.

The ideal scenario is that we both 1) make sure each sentences from corpus had been recorded and validated at least one time, by 2) encourage people to record as much as possible but prioritize new sentences when recording and 3) substrate a special tsv for DeepSpeech’s requirement that no repeat sentences included.

1 Like

Generally I am okay with this approach, it could even be a chance to motivate people to collect more sentences. But three thoughts on this:

  • Please don’t delete existing duplicates
  • One main problem I see is that people skip incorrect sentences or report them, but they do not disappear from the queue. When a language runs out of sentences only these incorrect sentences will be left in the queue. So if you force small language to have no repetitions, then you should also implement a process that really removes reported sentences from the list of recordable sentences instead of just writing them in a list that someone has to check manually.
  • There are tons of very similar sentences in the dataset already that could count as duplicates. Fore example sentences that only differ in one word or sentences that end with a question mark instead of a full stop. Especially the sentences from Wikipedia often have a very similar structure.
1 Like

Can’t speak for the current status after this change, but this is what was done in previous Common Voice releases: validated.tsv contains much more than train.tsv, allowing to re-generate the data with duplicated recordings.

I’ll have to say too to not just delete duplicate recordings of sentences. Move them to the end of review queue and disallow their rerecording for all I care, but just deleting them feels disrespectful to the work of volunteers who donated them, if nothing else. Especially since what’s best for majority of uses isn’t best for all.

This is true also for us (Italy) where exists dataset of the same sentence read from different people from different region to analyze the various different accents.

This is kind of true again.

I agree totally!

If the focus is to not just “hey this is a list of sentences with their audio” there is no reasons to discuss about it.
If the discussion instead is about “create the most complete list of recordings to maximize the volunteers effort” we can talk it.

Maximize the volunteers effort will be just let the recordings multiple for a same sentence maybe just a specific ratio, like 3.
I think that this problem exists also in english that is spoken by a lot of people that isn’t a native speaker.

Anyway thanks to remember that this project follows the open by design philosophy because sometimes this is not clear or easy to see in Mozilla.

PS: maybe is time to let volunteers see the list of sentences reported just to improve the sentences itself if the project is working to improve the quality overall otherwise is like a “colander” where just one hole is closed and the rest all opened.

Hi all, thanks for the detailed and thoughtful discussion here. We are still refining our approach towards how to prioritize recording and validation in a way that maximizes the efficiency of our volunteers’ efforts, and this is merely one step in that long process. For now, I just wanted to clarify two things:

  • Each dataset we release includes all voice data that the platform has collected, including clips that have not yet made it through the validation queue and clips that have been rejected by the validation queue. Any repeated voice clips, even if they have not been validated by volunteers, will continue to be available in the dataset.
  • We do not ever delete data from contributors, except when individuals request deletion of their own personal contributions. We take the trust and time of our volunteers very seriously, and we completely agree that simply discarding hundreds of hours of effort would be incredibly disrespectful.

We are working on including a list of reported sentences in the next dataset release, which will hopefully give you all a better sense of how to further refine collection efforts, and we totally hear you on needing to integrate that into the validation queue logic. Thanks again for all your feedback.

3 Likes

Through validated.tsv as it is right now, is that right?

Repetitions that have already been validated will be in validated.tsv, yes. Repetitions that have not received enough votes to be validated yet will be in other.tsv.

1 Like