Common Voice Project Update - December 27th 2019

2019 is quickly coming to a close and we are working on analyzing data, seeing what worked in 2019 and what we can do better in 2020. Deciding the immediate and future direction of the product based on technical needs, community requests and dataset quality. Thank you to everyone who has been a part of the community in 2019. We look forward to working with you next year!

Community

Campaigns

In H2 we were able to collect hundreds of hours of data through campaigns run by the community team. They were able to test different methods and pull the right levers to ensure contribution across many languages. Below, you can see the number of hours validated in the second half of 2019.

Language and Accent strategy

Currently with the Mozilla Legal team and is expected to have work started on it in February 2020.

App Update

Roadmap

The app team is meeting in the first week of January to solidify and scope what the first half of 2020 will look like and we will be able to share a roadmap shortly.

Working on Dataset optimization in early 2020 and what the parameters are for a quality dataset. This requires us to work with the machine learning team to ensure that we are collecting data in a way that will be useful for everyone who needs it.

Partner Challenge

For those who have been following along with the Open Voice Data Challenge Pilot, we now have some results and are deciding how to move forward once we finish the app infrastructure needs.

Below you can see two of the metrics we looked at which is, week over week engagement as well as level of contribution. Overall the challenge was a success with a few tweaks that need to be made before we release it to a wider audience.

  • Contributors who are part of a challenge are much more likely to come back to the Common Voice site and contribute week over week.

  • Contributors who were part of a challenge were also more likely to be classified as core contributors. Currently, 2% of Common Voice contributors speak or listen to 250 or more clips. In the challenge, we found that the number jumped to 29% for those involved in the pilot program. 2019 is quickly coming to a close and we are working on analyzing data, seeing what worked in 2019 and what we can do better in 2020. Deciding the immediate and future direction of the product based on technical needs, community requests and dataset quality. Thank you to everyone who has been a part of the community in 2019. We look forward to working with you next year!
5 Likes

Hey, thanks for the update. A few month ago you released the list with the reported sentences for every language. Will you do this again?

Hi Stergo, Are you speaking about the Dataset release? If yes, we plan to have another release sometime in January.

Hey Isaunders, no I meant the sentences with errors that were reported by users. A few months ago someone published the list.

Great news though :slight_smile:

Hi @r_LsdZVv67VKuK6fuHZ_tFpg ,

I thank you for all the information, but we could add an extra section like the upcoming talks or worshops, whose dates we know:


Upcoming events:

  • January 17th: Hellosct1 & Alexandre L. - Workshop “Common Voice” at the OPEN OFFICE in Paris, France
  • February27th : Hellosct1 - Talk “Give the machines a voice” in Montreal, Canada

Christophe

Thanks for the info @hellosct1

My understanding is that currently we don’t have an easy place for getting all the events that are happening about common voice (only from the Reps site).

Ideally this will change once the #community-portal is launched and anyone can create their events there, so we will be able to list them all easily.

1 Like