Common Voice Quarter 2 2020 Community Update

Christos · July 21, 2020, 3:08pm

It’s halfway through 2020, and we are excited to share all the great things happening for Common Voice! This spring we collected our first target segment and released it as part of the 2020 Mid-Year Dataset Release. Simultaneously we continued to tune our new infrastructure to ensure the best contribution experience possible, especially at times of high visitor volume. Moving into the second half of the year we’re preparing for a domain migration and kicking off planning to enable our next steps with improving dataset quality and access.

But first, the details…

Single Word Target Segment

This May, Common Voice experimented with collecting its first target segment of voice data. This single word target segment included the numbers zero through nine, the words “yes” and “no”, as well as “Hey” and “Firefox", and has been a very important step for the Common Voice project in experimenting with more nuance in data collection. This segment creates use case specific benchmarks from which the Common Voice platform can be tested for the quality of its data collection.

The single word target segment is now available for download as part of the 2020 Mid-Year Dataset Release.

This initial experiment was one of limited collection time and duration. In the second part of 2020, the team will work to introduce more target segments for collection geared toward specific use cases. We’re also working to enable opt-in or -out capability for these future segments, with the added benefit of re-enabling collection of the single word segment in the future. Not only do these segments make our data more comprehensive and actionable, they are vital to giving people around the world better tools for innovation.

Single Sentence Record Limit

This spring, the Common Voice platform began to limit voice recordings to one valid clip per sentence, across all languages. This decision was made in concert with our Deep Speech colleagues and various machine learning experts, who confirmed that limiting sentence and clip repetition is the right approach for improving the quality of our data. We recognize that some smaller language communities may have a harder time collecting ~2k hours of valid voice data with only unique sentences, and so we have implemented exemptions to the single sentence record limit for any languages that have fewer than 500,000 speakers globally. We understand that this is an evolving conversation, and welcome additional feedback from you on how we can iterate on this feature. We will also continue to investigate ways to include a greater diversity of data for those use cases where repetition is helpful (e.g. the single word segment which is intentionally based in repetitions).

2020 Mid-Year Dataset Release

Common Voice was created to build an open voice dataset that is publicly available to everyone. As a result of our community’s hard work and amazing engagement, at the end of June we released the 2020 Mid-Year Dataset (version 5.0)*. This latest release features 7,226 total hours of voice data, including 120 hours of the single word target segment. Read more specifics about that release here.

In order to provide more context around our datasets, as well as to provide a central point of feedback, we’ve created a Github repo for dataset versioning. The changelog and stats for the last four dataset releases are currently available, and over the coming weeks we will be working to backfill even older metadata. We are also working to enable access to older datasets - stay tuned!

*In the weeks following release it came to our attention that version 5.0 unintentionally altered the column order of the test / train / dev TSV files and included some redundant metadata entries for clips that didn’t actually have valid audio. Version 5.1 has been released and is now available for download.

Infrastructure Improvements

After all the infrastructure work and upgrades accomplished in Q1, the target segment collection effort was an excellent way to stress-test our improvements. We ran a “snippets” campaign on Firefox for the single word target segment that directed nearly 10x traffic to the Common Voice platform. While the traffic from previous campaigns lead to site wide downtime, our new infrastructure kept the site up and running through this recent campaign. We did, however, experience slightly longer response times — especially for validation and dashboard stats — and the additional traffic load exposed more places to fine tune our site performance. Stay tuned for more updates!

Streamlining Community Channels for Feedback and Feature Requests

Earlier this spring we began streamlining our community channels to better support our contributors. The new support channels are as follows…

All technical issues about the platform should be submitted to the platform repo on GitHub.
All technical issues about the dataset should be submitted to the dataset repo on Github.
Feature requests, strategic discussions and other dataset conversations will stay on Discourse.
Other issues reported through the platform or other inquiries will be submitted via email.

For more information, revisit our Q1 community update.

In addition, we are aligned and going to work on all the recommendations the Support team at Mozilla (SUMO) did for Common Voice last month. Some of these core ideas include:

Identifying recurring issues and updating our FAQ with more relevant common replies.
Having a predefined template on GitHub to simplify the triage process.
Hosting developer documentation in the Mozilla Developer Network website.

Common Voice’s New Domain

Common Voice is getting a new home! On July 28, 2020, Common Voice will officially move to commonvoice.mozilla.org.

The current domain at voice.mozilla.org will instead become the foundational home for a brand new and exciting project, an all-encompassing Mozilla Voice developer tools program.

As part of a larger voice strategy, Mozilla is building voice developer tools that drive adoption of our machine learning based voice technologies. Currently there are many different sources of information for Mozilla Voice developer tools and it’s not clear which tools are mature enough to be used, how to engage with our tech stack, and how our tools complement one another. The website voice.mozilla.org will eventually serve as the main access point to our developer tools and tell a more cohesive Mozilla Voice story. Until this new site is ready, voice.mozilla.org will be redirecting all traffic to commonvoice.mozilla.org.

To everyone who has been contributing to Common Voice, once again, thank you! Your hard work, creativity and support are noticed and valued. We know we wouldn’t be able to achieve as much as we do without our community. Over the next few weeks, Common Voice will be updating our product roadmap and setting new goals for the rest of the year. We are so excited to discover what we can achieve together in the second half of 2020!

Cheers,
Christos + the Common Voice team

lissyx · July 21, 2020, 3:22pm

Can we ensure this is properly highlighted to people? Maybe a banner?

mbranson · July 21, 2020, 5:30pm

@lissyx this will be an automatic redirect until the new site content is live at voice.mozilla.org. Once that site content is live, yes there will be a prominent banner displayed.

irvin · July 22, 2020, 3:02am

this link seems to wrong place

stergro · July 22, 2020, 7:51am

Hmm I understand why you do this. But this will lead to tons of dead/wrong links from old articles, reddit threads, …

Please add at least a banner or a message about common voice to this new page for a year or so, it will take very long until people will stop looking for CV under this domain.

This also means that I will have to change a few banners and other promo material. When will you activate commonvoice.mozilla.org ?

Fjoerfoks · July 22, 2020, 9:16am

Not to happy either with the domain change, since we made lots of promo material.
Will localized links, like voice.mozilla.org/nl also properly be redirected to commonvoice.mozilla.org/nl?

As being/feeling responsible for Frisian, I don’t feel comfortable with the single sentence record limit for languages with less than 500,000 speakers. ~2k hours seems a lot to me and Frisian is stated with ~750,000 speakers. Maybe a million speakers is more in place. Can this be reconsidered?

lissyx · July 22, 2020, 9:26am

I agree this is going to be a painful given all the existing communication material (we have the same issue for french, and it’s obviously a widespread issue for all locales), but as @mbranson said the transition will try and make that as smooth as possible (and there is no reason why localized links should not work), and we can try and take the opportunity to make even more publicity about the project.

alberto · July 22, 2020, 11:52am

Yes, the redirection will preserve the path.

Christos · July 22, 2020, 2:18pm

Thanks Irvin! I just updated to the correct one.

Thank you all (@stergro, @Fjoerfoks, @lissyx) for your comments and concerns regarding the domain change.
The new domain is planned to go live in a week from now, on the 28th of July. @alberto and @phirework are making sure that all the hard links like: voice.mozilla.org/nl will be redirected to commonvoice.mozilla.org/nl.

The Voice Developer tools website won’t go live before the end of August. That gives us a lot of time to ensure a smooth transition by informing our community throughout our channels and a banner on our landing page.

Our goal is to achieve the smoothest transition possible that won’t affect our contributors’ experience at all. Other than the huge banner informing them for the new domain

@Fjoerfoks I am updating this comment since I didn’t realize that we have already shipped the exception of Single Sentence Record Limit for languages with fewer than 500k speakers globally. Here are the release notes.

mkohler · July 22, 2020, 10:00pm

Does that imply that the new content at voice.mozilla.org won’t be translated? Or are you gonna use a different locale scheme?

Can you elaborate on that?

Christos · July 24, 2020, 2:27pm

The redirection preserving the path will be there until we serve a full site under voice.mozilla.org. Regarding the localization of that website, we don’t have a clear timeline yet. We will certainly deliver an English only version first, and then we are going to work on possible localization.

I am referring to MDN, the Mozilla Developer Network website. I just updated the sentence and added a link too. Thanks

lissyx · July 24, 2020, 2:53pm

If it’s made localizable and exposed on Pontoon, it’ll be done.

ComputArte · July 31, 2020, 3:54pm

Hi Christos,
I am trying to contact you since the 17th July’20 but not success.( I wrote you twice to : replies+c517fc62e93826b83e0ba76bc6a270bf@discourse.mozilla.org )
Can you drop a line on cybertec@computarte.it ? Thank you Vasco

Christos · August 4, 2020, 10:27am

Thank you Vasco, I am reaching out now