Hello Common Voice Community,
We are excited to let you know about the roadmap we have planned for the first half (H1) of 2019 based on the conversations we had during the Berlin staff meetup that took place in February.
- Partnerships: The Common Voice team is looking to optimize partnerships that work towards both sentence collection as well as voice collection for the dataset.
- Website/app: The website is a ever evolving collection tool for Common Voice and we are working on optimizations to ensure both ease of use and a sense of community for the user.
- Community: Design and test a community engagement model in Mandarin and English that can scale to other languages, collecting 300 additional hours of voice data and informing a self-serve community strategy.
- Mandarin: We are focused on collecting what we are calling a "Minimum Viable Dataset" for training Deep Speech in Mandarin (2000 hours of voice data), for inclusion in Mozilla products in the second half of 2019. We need to use a variety of methods, each which is unlikely to work on its own, but can combine to produce a viable dataset: existing data from partners, the Common Voice app, and testing out some Mechanical Turk-style paid crowdsourcing.
- Voice strategy: We are working with various parts of Mozilla to understand the voice products and underlying technologies ecosystem, and produce recommendations on what that means for the directions Mozilla should pursue.
We have been looking at the numbers and taking in your suggestions to create new and better features for the Website. Some of the exciting developments you should expect to see this quarter are personal goal setting and connecting with friends. These features will allow us to continually grow the Common Voice corpora and instal a sense of community in the app. Everyone is donating their voice to help support an open corpora and you should be able the see friends who are also contributing. We will also be A/B testing 3 different features to see how we can best optimize the website for all of our users. The goals in these optimizations include better guidance of first time users and understanding how traffic is moving through the Common Voice website to increase language collection.
We have been working really closely with the Deep Speech team to understand the minimum characteristics a dataset must meet to be useful for them. Based on their calculations we will need a minimum of 1000 speakers and 2000 spoken and verified hours.
Focus and strategy
We realized that in order to learn the best way to gather a minimum dataset we need to complete this journey for at least a couple of languages. That’s why in the coming months we will focus staff time into English and Mandarin and we are working toward trying alternative collection techniques in both of these languages.
We are starting on this process by engaging in new partnerships and looking at possible events and collection hubs. The team knows how important it is to have a plan and be able to prove that plan, so we can better direct others in language collection and focus in on two very different languages and try new techniques will show us the best ways to engage and sustain languages.
While the team is focusing on those two languages, we are excited to see what the community comes up with and the learnings from community focused languages. Every language is important and we need the community help to push them forward.
We have a lot to do in the first half of this year and the team is very excited to continue work on this project. Stay tuned for the experiments and A/B testing and let us know your thoughts!
-The Common Voice Team