Common Voice Roadmap Update

r_LsdZVv67VKuK6fuHZ_tFpg · March 26, 2019, 9:43pm

Hello Common Voice Community,

We are excited to let you know about the roadmap we have planned for the first half (H1) of 2019 based on the conversations we had during the Berlin staff meetup that took place in February.

Partnerships: The Common Voice team is looking to optimize partnerships that work towards both sentence collection as well as voice collection for the dataset.
Website/app: The website is a ever evolving collection tool for Common Voice and we are working on optimizations to ensure both ease of use and a sense of community for the user.
Community: Design and test a community engagement model in Mandarin and English that can scale to other languages, collecting 300 additional hours of voice data and informing a self-serve community strategy.
Mandarin: We are focused on collecting what we are calling a "Minimum Viable Dataset" for training Deep Speech in Mandarin (2000 hours of voice data), for inclusion in Mozilla products in the second half of 2019. We need to use a variety of methods, each which is unlikely to work on its own, but can combine to produce a viable dataset: existing data from partners, the Common Voice app, and testing out some Mechanical Turk-style paid crowdsourcing.
Voice strategy: We are working with various parts of Mozilla to understand the voice products and underlying technologies ecosystem, and produce recommendations on what that means for the directions Mozilla should pursue.

Website

We have been looking at the numbers and taking in your suggestions to create new and better features for the Website. Some of the exciting developments you should expect to see this quarter are personal goal setting and connecting with friends. These features will allow us to continually grow the Common Voice corpora and instal a sense of community in the app. Everyone is donating their voice to help support an open corpora and you should be able the see friends who are also contributing. We will also be A/B testing 3 different features to see how we can best optimize the website for all of our users. The goals in these optimizations include better guidance of first time users and understanding how traffic is moving through the Common Voice website to increase language collection.

We have been working really closely with the Deep Speech team to understand the minimum characteristics a dataset must meet to be useful for them. Based on their calculations we will need a minimum of 1000 speakers and 2000 spoken and verified hours.

Focus and strategy

We realized that in order to learn the best way to gather a minimum dataset we need to complete this journey for at least a couple of languages. That’s why in the coming months we will focus staff time into English and Mandarin and we are working toward trying alternative collection techniques in both of these languages.

We are starting on this process by engaging in new partnerships and looking at possible events and collection hubs. The team knows how important it is to have a plan and be able to prove that plan, so we can better direct others in language collection and focus in on two very different languages and try new techniques will show us the best ways to engage and sustain languages.

While the team is focusing on those two languages, we are excited to see what the community comes up with and the learnings from community focused languages. Every language is important and we need the community help to push them forward.

We have a lot to do in the first half of this year and the team is very excited to continue work on this project. Stay tuned for the experiments and A/B testing and let us know your thoughts!

-The Common Voice Team

dabinat · March 14, 2019, 10:31pm

I hope that any community engagement / sprints also include Sentence Collector. Since the DeepSpeech model only uses one recording per sentence, it would be wasteful to have lots of users come to the site only to record duplicate sentences that won’t be used by DeepSpeech. So IMO when sprinting, there needs to be an effort to make sure the number of sentences can keep up with the number of speakers.

Michael_Maggs · March 15, 2019, 11:35am

To follow up on @dabinat’s comment, running a community event in English that includes recording is definitely not feasible without significant improvements to the Sentence Collector. At the moment, validation of uploaded sentences is so slow and difficult that very few people are attempting it. Those that are getting burnt out by the sheer tediousness of the process, not to mention the physical wrist pains that develop moving the mouse backwards and forwards across the screen hour after hour.

The Sentence Collector is a real bottleneck. I could very quickly provide several tens of thousands of English sentences extracted from public domain sources, but at the moment there is little point in doing so as they would take many months to get validated - by which time even without a community event existing sentences will have been read many times.

You’re going to need some sort of mechanism to validate uploaded sentences much more quickly: either an improvement to the Sentence Collector, or an agreement to bypass it. I know that you want to make sure that everything goes through the Collector, but you could perhaps accept some big direct uploads, make a note of the sentences involved, and trickle them through the Collector as recording is going on. Any that fail Collector validation would then be deleted whether or not they had already been recorded.

r_LsdZVv67VKuK6fuHZ_tFpg · March 19, 2019, 11:15pm

Hello @Michael_Maggs and @dabinat, We are going to be looking into what changes and optimizations need to happen for the sentences collector and including the community on that. These optimizations go both for finding a better way to enter and accept sentences to ensuring that people know where and how to enter those sentences. We’ll have more information about this over the next couple of weeks as we scope out what the team can work on.

jf99 · March 23, 2019, 3:32pm

Before we optimize the sentence collector for speed and advertise it on the main page of CV, we need to ensure a good quality of the the collected sentences first (imo).

Therefore a dialog between writers and validators is indispensable. Writers must know why their contributions are downvoted. Otherwise they will not learn to avoid their spelling/grammar mistakes in the future.

nukeador · March 25, 2019, 2:45pm

Hello everyone,

I’ve been a few weeks out but now I’m back and working with the team to explore all the options we have to enable communities to collect sentences in a more agile way, it’s clear from all your feedback that this is something we should be putting our focus on in order to empower you and your communities.

Expect some ideas back for feedback in the coming weeks.

Thanks!

alfem · September 3, 2019, 8:04am

What about parsing public domain books? Probably they use old syntax, but there is a lot of sentences in them.

This site has got spanish, french and italian books: http://www.dominiopublico.es/

nukeador · September 3, 2019, 11:22am

Yes, once we improve our wikipedia extractor script and automate its process we want to start exploring how to use the learning to build something to parse large public domain corpus and split them into sentences that follow our established rules for each language.