Common Voice Dataset Release - Mid Year 2020

Read this post in other languages: Español

More data, more languages, and introducing our first target segment!

We are halfway through 2020, and already it’s been an exciting year for Common Voice! Thanks to the enthusiasm and incredible engagement from our Common Voice communities, we are releasing an updated dataset with 7,226 total hours of contributed voice data. 5,591 of these hours have been confirmed valid by our diligent contributors. Dataset fun fact: this release comprises over 5.5million clips*!

Not only is Common Voice growing, it’s continuing to diversify. This release includes voice recordings in 54 languages, 14 of these languages** are new to the platform and dataset. The platform is seeing more languages with over 5,000 unique speakers*** and an increase in languages with over 500 recorded hours****. With contributions from all over the globe, you are helping us follow through on our goal to create a voice dataset that is publicly available to anyone and represents the world we live in.

We are also proud to announce the release of our first ever dataset target segment! In May, Common Voice started collecting voice data for a specific purpose or use case. Now, we’re releasing the single word target segment which includes the digits zero through nine, as well as the words yes, no, hey and Firefox. The released target segment is 120 total recorded hours, with 64 valid hours, across 18 languages. It was created in one month by over 11,000 unique contributor voices! This segment data will help Mozilla benchmark the accuracy of our open source voice recognition engine, Deep Speech, in multiple languages for a similar task and will enable more detailed feedback on how to continue improving the dataset.

From the whole Voice team at Mozilla: Thank you for your ongoing contributions, your support and your enthusiasm! Going into the second half of 2020, we look forward to continuing our mission to build a better, more open, internet.

Cheers,

Megan + the Common Voice team


*Average clip duration is 4.7 seconds.

**14 new languages included with this release: Upper Sorbian, Romanian, Frisian, Czech, Greek, Romansh Vallader, Polish, Assamese, Ukranian, Maltese, Georgian, Punjabi, Odia, and Vietnamese.

***Languages with over 5,000 unique speakers: English, German, French, Italian, Spanish

****Languages with over 500 recorded hours: English, German, French, Kabyle, Catalan, Spanish, Kinyarwandan

19 Likes

Great job everyone!

@mbranson Now that English is a 50 GB download and future datasets will have even more data, will there be efforts to reduce file sizes in future? This could include splitting it up into separate downloads (validated, rejected, unvalidated) or using a more efficient codec like Opus.

1 Like

and maybe add alternative download options will be helpful as well, for example .torrent

1 Like

Thanks both for the input – agreed 50gb is quite large and difficult to parse, especially at slower bandwidth. We’re in progress on enabling multiple smaller file downloads for each language, though didn’t want that effort to delay making the data available. :slight_smile: Keep an eye out for this sooner than later.

Also note that we’re exploring ways to improve access to the dataset overall and will be prototyping (at least in the tech stack to start) how we can move away from larger releases to smaller more continuous ones. It’s our long term goal to make the dataset more self-serve and accessible no matter where you are. This is a key theme of work for the team as we jump into the second half of 2020 and are just starting to scope it. Stay tuned. :tv:

3 Likes