One of the most important components to build a strong dataset of voices is always being able to provide people with enough sentences to read in their language. Without this, voice collection is not possible, and as a team we have been putting in a lot of work to emphasize sentence collection since the launch of Common Voice.
To make the Common Voice dataset as useful as possible we have decided to only allow source text that is available under a Creative Commons (CC0) license. Using the CC0 standard means its more difficult to find and collect source text, but allows anyone to use the resulting voice data without usage restrictions or authorization from Mozilla. Ultimately, we want to make the multi-language dataset as useful as possible to everyone, including researchers, universities, startups, governments, social purpose organizations, and hobbyists.
In the early days, sentence collection was an immature process, we accepted sentences using different channels (email, github, discourse…), which lead to a heavy workload for staff and inconsistent quality checks, as well as additional work to clean up sentences that were not useful for the Deep Speech algorithm.
As a result of that, late last year we put in place a tool to centralize the sentence collection and review, automating a lot of the quality checks and establishing a workflow that ensured peer-review was done. We have already collected 424K sentences in 45 languages, of which 79K have been already validated and incorporated into the main site.
We have been hearing a lot of your feedback over the last few months and we acknowledge the limitations and difficulties this process has for some of you. We will keep working on improving the experience.
New approaches for a big challenge
During the past months, we have also been exploring alternatives to collect the volume of sentences needed to cover the voice collection demand. If we want to get to an initial 2,000 validated hours milestone for what we call a “minimum viable dataset”*, the math tells us that we’ll need at least 1.8M unique sentences (4s each on avg.) per language if we don’t want to have more than one recording on each one**. This is needed for Deep Speech model quality as we commented in our H1 roadmap. Even with all the amazing community contributions we’ve seen - time beats us here.
That’s why we’ve been looking into other big sources of sentences out there, and I’m happy to announce that our investigations and legal counsel provided us with a legitimate way to tap into one of the biggest and most important sources of information of our time. We are able to use sentences from Wikipedia as long as we don’t extract more than 3 random sentences per article. In this way we can use these sentences in the project as under fair-use copyright provisions. In case you’re wondering, we’ve let Wikimedia know about this.
We know Wikipedia won’t be able to cover our needs for all languages, we want to expand our investigation and provide communities with the resources to do the same with other sources. Additionally we will keep supporting the sentence collection community-driven approach as an important complementary way to submit, review and import sentences from other small sources or that have been manually submitted.
This new approach unlocks a potential huge source of already reviewed sentences, and we have been working on a way to automate this work for Chinese Mandarin (in Simplified characters) and English in the past weeks, leveraging our previous work on validation rules done by the sentence collector through community feedback. We have been generating per-language rules to make sure the extraction had less than 5-10% error rate and that sentences are complete and readable. We are working on ways for contributors to flag any issue with sentences displayed on the site. We will provide more details on quality control in upcoming communications.
Our plan for the future is to work with communities to help create more per-language rules. This way we will be able to extract at once the amount of sentences needed for voice collection. We also have plans to extend the extraction script capabilities in the future to support per-source rules, allowing us to plug in other big sources of text where we can legally do the same.
Please keep an eye on this channel, we will be engaging with communities in the coming weeks with more information.
Ruben, on behalf of the Common Voice Team
*the amount of data required to train an algorithm to hit acceptable quality for it to be considered trained and minimally functioning. In the case of DeepSpeech, the minimum viable dataset is 2K hours of validated speech with its accompanying text. To train a production speech-to-text system we estimate the upper bound of hours required is 10K.
**noting English has a head-start with other data sources, but this is the goal for the Common Voice data, which has the unique characteristic of a lot of speaker diversity
Update (July 24th 2019): The wiki extractor tool is now ready for technical testing.