It’s halfway through 2020, and we are excited to share all the great things happening for Common Voice! This spring we collected our first target segment and released it as part of the 2020 Mid-Year Dataset Release. Simultaneously we continued to tune our new infrastructure to ensure the best contribution experience possible, especially at times of high visitor volume. Moving into the second half of the year we’re preparing for a domain migration and kicking off planning to enable our next steps with improving dataset quality and access.
But first, the details…
Single Word Target Segment
This May, Common Voice experimented with collecting its first target segment of voice data. This single word target segment included the numbers zero through nine, the words “yes” and “no”, as well as “Hey” and “Firefox", and has been a very important step for the Common Voice project in experimenting with more nuance in data collection. This segment creates use case specific benchmarks from which the Common Voice platform can be tested for the quality of its data collection.
The single word target segment is now available for download as part of the 2020 Mid-Year Dataset Release.
This initial experiment was one of limited collection time and duration. In the second part of 2020, the team will work to introduce more target segments for collection geared toward specific use cases. We’re also working to enable opt-in or -out capability for these future segments, with the added benefit of re-enabling collection of the single word segment in the future. Not only do these segments make our data more comprehensive and actionable, they are vital to giving people around the world better tools for innovation.
Single Sentence Record Limit
This spring, the Common Voice platform began to limit voice recordings to one valid clip per sentence, across all languages. This decision was made in concert with our Deep Speech colleagues and various machine learning experts, who confirmed that limiting sentence and clip repetition is the right approach for improving the quality of our data. We recognize that some smaller language communities may have a harder time collecting ~2k hours of valid voice data with only unique sentences, and so we have implemented exemptions to the single sentence record limit for any languages that have fewer than 500,000 speakers globally. We understand that this is an evolving conversation, and welcome additional feedback from you on how we can iterate on this feature. We will also continue to investigate ways to include a greater diversity of data for those use cases where repetition is helpful (e.g. the single word segment which is intentionally based in repetitions).
2020 Mid-Year Dataset Release
Common Voice was created to build an open voice dataset that is publicly available to everyone. As a result of our community’s hard work and amazing engagement, at the end of June we released the 2020 Mid-Year Dataset (version 5.0)*. This latest release features 7,226 total hours of voice data, including 120 hours of the single word target segment. Read more specifics about that release here.
In order to provide more context around our datasets, as well as to provide a central point of feedback, we’ve created a Github repo for dataset versioning. The changelog and stats for the last four dataset releases are currently available, and over the coming weeks we will be working to backfill even older metadata. We are also working to enable access to older datasets - stay tuned!
*In the weeks following release it came to our attention that version 5.0 unintentionally altered the column order of the test / train / dev TSV files and included some redundant metadata entries for clips that didn’t actually have valid audio. Version 5.1 has been released and is now available for download.
Infrastructure Improvements
After all the infrastructure work and upgrades accomplished in Q1, the target segment collection effort was an excellent way to stress-test our improvements. We ran a “snippets” campaign on Firefox for the single word target segment that directed nearly 10x traffic to the Common Voice platform. While the traffic from previous campaigns lead to site wide downtime, our new infrastructure kept the site up and running through this recent campaign. We did, however, experience slightly longer response times — especially for validation and dashboard stats — and the additional traffic load exposed more places to fine tune our site performance. Stay tuned for more updates!
Streamlining Community Channels for Feedback and Feature Requests
Earlier this spring we began streamlining our community channels to better support our contributors. The new support channels are as follows…
- All technical issues about the platform should be submitted to the platform repo on GitHub.
- All technical issues about the dataset should be submitted to the dataset repo on Github.
- Feature requests, strategic discussions and other dataset conversations will stay on Discourse.
- Other issues reported through the platform or other inquiries will be submitted via email.
For more information, revisit our Q1 community update.
In addition, we are aligned and going to work on all the recommendations the Support team at Mozilla (SUMO) did for Common Voice last month. Some of these core ideas include:
- Identifying recurring issues and updating our FAQ with more relevant common replies.
- Having a predefined template on GitHub to simplify the triage process.
- Hosting developer documentation in the Mozilla Developer Network website.
Common Voice’s New Domain
Common Voice is getting a new home! On July 28, 2020, Common Voice will officially move to commonvoice.mozilla.org.
The current domain at voice.mozilla.org will instead become the foundational home for a brand new and exciting project, an all-encompassing Mozilla Voice developer tools program.
As part of a larger voice strategy, Mozilla is building voice developer tools that drive adoption of our machine learning based voice technologies. Currently there are many different sources of information for Mozilla Voice developer tools and it’s not clear which tools are mature enough to be used, how to engage with our tech stack, and how our tools complement one another. The website voice.mozilla.org will eventually serve as the main access point to our developer tools and tell a more cohesive Mozilla Voice story. Until this new site is ready, voice.mozilla.org will be redirecting all traffic to commonvoice.mozilla.org.
To everyone who has been contributing to Common Voice, once again, thank you! Your hard work, creativity and support are noticed and valued. We know we wouldn’t be able to achieve as much as we do without our community. Over the next few weeks, Common Voice will be updating our product roadmap and setting new goals for the rest of the year. We are so excited to discover what we can achieve together in the second half of 2020!
Cheers,
Christos + the Common Voice team