✅ June Validation Campaign: Enhance the upcoming dataset release!

nukeador · June 7, 2020, 10:08pm

Other languages: Español - Deutsch - Français

Hello everyone,

I’m happy to share that we are preparing everything for our next Common Voice dataset release and we want to make sure we can include as many validated hours as possible to increase its quality and usefulness.

Our goal is to release the latest data on approximately June 30th, 2020. The release of a new dataset requires some preparation and the Common Voice team is planning to initiate compilation of the latest data on June 22nd, 2020. This is considered the cut-off date for recorded and validated data to be included with the next dataset release.

Most languages have a significant number of recorded hours still waiting to be validated. We want to encourage everyone to focus your energies and communities on validating as much as possible before June 22nd. This will allow these hours to be released in the latest version of the dataset.

This will also help researchers and people training speech recognition models to have more data at their disposal to train initial models in your languages. This will also help attract more people to contribute to the project.

How can you help?

If you are already contributing to Common Voice, focus your time toward listening and set-up a personal goal on your profile to have a reminder about it.

Please read and share the following community guidelines to know how to better validate voices.

Talk with your community, explain why having as many validated hours as possible by the end of June is important. Tell them about how to create a profile on the site. Set up a personal goal and review the validation guidelines (you might want to localize this topic and guidelines, then publish on your language Discourse).

Encourage fun activities to get people validating a few minutes everyday and make some noise on your community and social networks.

Thanks everyone for your contributions!

nukeador · June 3, 2020, 5:21pm

stergro · June 4, 2020, 7:30am

Thanks for being more transparent about the timeline of the preparation of the dataset this time. I made a German translation of this post:

hellosct1 · June 7, 2020, 7:00pm

Hi, I made a French translation of this post:

nukeador · June 15, 2020, 1:13pm

Reminder: We are one week away from June 22th, deadline to get as much voices validated as possible to be part of the next dataset release.

Please make sure you let your communities know and push for validation this week!

Thanks so much for your contributions!