Keeping our commitment towards improving our Dataset’s Quality, last week we launched a new feature for decreasing data repetition.
The problem
Common Voice previously allowed people to record the same sentences when new sentence queues were exhausted, resulting in data repetition. Research has shown that the majority of people using the Common Voice dataset prefer one recording per sentence when training models. This provides lower word error rates compared to using redundant clips.
The solution
As of last week, the Common Voice platform is beginning to limit recordings to one validated clip per sentence across all languages. This is the first implementation step for this feature and will ensure that recording repetitions are phased out as voice contribution enters the second half of 2020.
To minimize disruption to the contributor experience, we have decided to gradually backfill the data needed to make this determination. Updates will be triggered every time there is a new recording or a new validation. This means that you may see a divergence in clips recorded compared to clips available to be validated each day. The act of recording or validating a sentence that already has a validated clip will remove all other clips for this sentence from the validation queue.
For some languages, this will be less noticeable than others, depending on each language community’s contribution cadence and the proportion of new sentences available for recording. We are working on an interim migration that will address the languages with the least amount of new sentences available. In the meantime, if you are noticing a significant discrepancy, that is an indicator that your language is running low on sentences to record, and we encourage you to refer to this post to see how you can help bring more sentences to the platform in your language.
The team is also aware that not all languages have a large enough contributor base to produce the 2,000 valid voice hours needed to train an STT model. We are planning on creating exemptions to this limit for languages with a smaller contributor population where it’s not realistic to achieve 2,000 valid voice hours, and at least 1k speakers, in the near future. This will allow languages to create datasets that mature in both size and quality as contribution grows. We’ll share more details on this exemption when it is ready.
We have worked closely with our Deep Speech colleagues and various machine learning experts to confirm that limiting sentence and clip repetition is the right approach for improving the quality of our data.
Common Voice follows the principles of Open by Design and all the technical details regarding the implementation of this solution can be found in this pull request on GitHub.
If you have any questions, comments, or suggestions, please reply to this post.
Thank you!
Christos and the Common Voice team