Common Voice Quarter 1 2020 Community Update

2020 has already been a big year for Common Voice. Entering the second quarter of the year, the Common Voice platform is equipped with exciting new features that set us up for success as the project moves forward.

Some of the features we have already implemented, as well as the ones which are planned for the near future, are the result of a broader strategy which allows us to get immediate and direct feedback from you, the community. This strategy also helps us direct our work to where it’s needed most, whether that be in long term visions or short term fixes.

In Q1, the Common Voice team has been busy strengthening our infrastructure, internal tooling, and the health and growth of our developer community. Let’s dive into the details.

Releases, Infrastructure, and Tooling Improvements

4200h Voice Dataset Release

We created Common Voice to build an open voice dataset that is publicly available to everyone. As a result of our community’s hard work and amazing engagement, at the beginning of the year, we released an additional 4200 Common Voice hours. We are currently working on another dataset release and expect this to be available around late-June / early-July.

Moving to Kubernetes

In an earlier update, we shared our plan to radically improve the infrastructure of Common Voice. Our engineers have been working to migrate our deployment infrastructure to Kubernetes, an industry leader in deployment management, and used by many other teams inside Mozilla. This process was completed in mid-April, and you should be seeing improvements in site stability and performance.

Enabling recording for Common Voice via mobile Safari

Supporting the open web is our #1 priority. With the MediaRecorder standard WebAPI still not supported in various Safari browsers, we didn’t want to continue directing our community and developers to an outdated app in a walled garden. Instead, we have implemented a lightweight polyfill for MediaRecorder that will enable recording on Safari, and allow us to decommission the iOS Common Voice app. Starting late April, we have been redirecting visitors, contributors, and developers interested in the Common Voice iOS app, to our mobile website. This way, we encourage and unlock further contributions to Common Voice, while focusing all our development efforts to the web platform, improving the web experience for everyone. You can read the detailed announcement here.

Tablet Demo Mode

Over the years Common Voice has been showcased at various conferences and events in order to showcase the platform and encourage participation. This created the need for a customized experience aimed to help the team and our community be more effective while showcasing Common Voice. We are happy to announce that a guided Tablet Demo Mode will be implemented as part of the Google Summer of Code program. We are thrilled to have so many talented applicants and we are looking forward to working with our selected student over the summer.

Improving our Dataset’s Quality

Last year, we spent a lot of time talking to you about our dataset’s quality. After taking this feedback to our internal team and conducting further research with industry experts, we established a quality criteria for high-quality STT (Speech To Text) datasets. This has helped us identify the biggest gaps between high-quality STT datasets and the Common Voice dataset as it is today.

This year, we want to bridge that gap, by introducing a set of updates to the Common Voice platform and our dataset.

An example of our commitment towards improving our dataset’s quality is our current effort to decrease data repetition. Currently, Common Voice allows people to record the same sentence over and over again once there are no new sentences. Our Deep Speech colleagues are using just one recording per sentence and the trained models provide lower word error rate than they would if they used many repetitions. This has been further confirmed by research with additional machine learning experts. As a result, we have decided to limit our validated recordings to one per sentence for languages where we expect to have enough validated hours to train an STT model.

Additionally, an important characteristic of a good dataset is to have a consistent and accurate annotation of the audio. This means that a recorded clip should only be considered valid if the text associated with the audio is 100% matched. We have received feedback from experts that Common Voice audio annotation is not considered to be of high quality…yet, and we have also heard requests from the community for more clear guidelines on annotation. This is a larger theme of work our team is committed to working on and efforts will kick-off as we move into the Second Quarter of the year.

Expect a detailed post about these feature releases as they roll out in the near future.

Release Notes come to Common Voice

Since last February, our amazing engineers have been including detailed notes with every Common Voice release. In these notes, anyone can find information about major changes, new features, improvements, and bug fixes. You can check all the previous notes, along with ones to follow on future releases on the relevant page on GitHub.

Community

Moving to Matrix

At the beginning of the year, we moved our Slack instance to a new Mozilla Matrix, in order to better serve and grow our community. Community participation guidelines are now enforced through its moderation tooling, creating a safe environment for communication. These changes also bring more visibility to Common Voice and other volunteer communities that are present in the Mozilla Matrix and vice versa. You can join the Common Voice community by clicking here and you can read more about it here.

Streamlining GitHub

We are thrilled to have so many contributors reporting issues or opening pull requests on our GitHub repository. However, while the volume of the contributions significantly increased over the past months, our ability to triage and prioritize code reviews has decreased. To resolve this, we are introducing a new labeling system and onboarding an additional team member to help triage both pull requests and GitHub issues. This will help us be more efficient and responsive while later this month we will be sharing a detailed approach to our efforts for supporting our code contributors. Our goal is to improve the project’s health and ensure the community development efforts are well supported to help the project grow.

Community Product Roadmap

Common Voice is a crowdsourcing platform, and the community is playing an important role in the development, health, and growth of the project. You have all been providing valuable feedback towards identifying the right direction of the platform and achieving our goals. Many of the features we are exploring, or plan to implement, are community ideas. In an effort to bring more transparency and give back to the community, we are publishing a roadmap with all the important features that we have prioritized and will be implementing. This roadmap is a living document and will be updated with more details and new features as we move forward.


Community Product Roadmap live-view

Last but not least, we would like to take this opportunity to give a shoutout to some of the community members that have been instrumental in helping Common Voice succeed.

  • Jindřich Dítě, for their work writing the Czech Wikipedia extraction rules, code contributions to the sentence collector and helping other people on GitHub and discourse.
  • @stergro:mozilla.org, for their work on the sentence extractor rules and feedback as well as supporting others on discourse.
  • @fjoerfoks:mozilla.org, for their work on the sentence extractor rules and feedback.
  • @txopi, for their work mobilizing the Basque community and coming up with the rules for the Wikipedia extraction.
  • @mkohler, for his continuous stewardship on the sentence collector and extractor. The progress and the status of these tools are primarily the results of Michael’s contributions.

To everyone who has been contributing to Common Voice, thank you! Community is a vital part of Common Voice and Mozilla. We would like to thank you all for your ongoing contributions, your support and creativity, your thoughtfulness, and your patience!

7 Likes

Hey, thanks for the update, very interesting and good looking news :slight_smile:

Since you mentioned the Matrix channel, there are also a few language specific channels:

This issue really comes up regularly. One thing that could be done quickly is adding a section about this to the FAQs. There is no information about this yet, and a few basic rules (like “Don’t speak the punctuation marks out loud”) would be really useful there.

2 Likes

I understand correctly that if I want to add my language, it is better to add a lot of unique sentences. For example, we have calculated that 10,000 hours is 6,000,000 sentences.
Is there any explanation why this is the case?

With a model such as a recurrent neural network, if you give it data with repeated sentences, it will likely memorize the whole sentence rather than paying attention to each word and sound. This is a type of overfitting, since it will not perform as well in the future on new sentences that it hasn’t memorized.

Thank you to the Common Voice team and volunteers! It is great to see the project maturing.

3 Likes

This is pretty clear explanation. Thanks.