Common Voice Dataset Release - Mid Year 2020

mbranson · August 4, 2020, 11:52pm

Read this post in other languages: Español

More data, more languages, and introducing our first target segment!

We are halfway through 2020, and already it’s been an exciting year for Common Voice! Thanks to the enthusiasm and incredible engagement from our Common Voice communities, we are releasing an updated dataset with 7,226 total hours of contributed voice data. 5,671 of these hours have been confirmed valid by our diligent contributors. Dataset fun fact: this release comprises over 5.5million clips*!

Not only is Common Voice growing, it’s continuing to diversify. This release includes voice recordings in 54 languages, 14 of these languages** are new to the platform and dataset. The platform is seeing more languages with over 5,000 unique speakers*** and an increase in languages with over 500 recorded hours****. With contributions from all over the globe, you are helping us follow through on our goal to create a voice dataset that is publicly available to anyone and represents the world we live in.

We are also proud to announce the release of our first ever dataset target segment! In May, Common Voice started collecting voice data for a specific purpose or use case. Now, we’re releasing the single word target segment which includes the digits zero through nine, as well as the words yes, no, hey and Firefox. The released target segment is 120 total recorded hours, with 64 valid hours, across 18 languages. It was created in one month by over 11,000 unique contributor voices! This segment data will help Mozilla benchmark the accuracy of our open source voice recognition engine, Deep Speech, in multiple languages for a similar task and will enable more detailed feedback on how to continue improving the dataset.

From the whole Voice team at Mozilla: Thank you for your ongoing contributions, your support and your enthusiasm! Going into the second half of 2020, we look forward to continuing our mission to build a better, more open, internet.

Cheers,

Megan + the Common Voice team

*Average clip duration is 4.7 seconds.

**14 new languages included with this release: Upper Sorbian, Romanian, Frisian, Czech, Greek, Romansh Vallader, Polish, Assamese, Ukranian, Maltese, Georgian, Punjabi, Odia, and Vietnamese.

***Languages with over 5,000 unique speakers: English, German, French, Italian, Spanish

****Languages with over 500 recorded hours: English, German, French, Kabyle, Catalan, Spanish, Kinyarwandan

dabinat · July 1, 2020, 4:08am

Great job everyone!

@mbranson Now that English is a 50 GB download and future datasets will have even more data, will there be efforts to reduce file sizes in future? This could include splitting it up into separate downloads (validated, rejected, unvalidated) or using a more efficient codec like Opus.

Daoaosdj · July 1, 2020, 4:13am

and maybe add alternative download options will be helpful as well, for example .torrent

mbranson · July 2, 2020, 12:04am

Thanks both for the input – agreed 50gb is quite large and difficult to parse, especially at slower bandwidth. We’re in progress on enabling multiple smaller file downloads for each language, though didn’t want that effort to delay making the data available. Keep an eye out for this sooner than later.

Also note that we’re exploring ways to improve access to the dataset overall and will be prototyping (at least in the tech stack to start) how we can move away from larger releases to smaller more continuous ones. It’s our long term goal to make the dataset more self-serve and accessible no matter where you are. This is a key theme of work for the team as we jump into the second half of 2020 and are just starting to scope it. Stay tuned.

mbranson · July 2, 2020, 12:07am

phirework · July 14, 2020, 10:09pm

Hey everyone! Just wanted to flag that we released a version 5.1 of this dataset, as it came to our attention that 5.0 unintentionally altered the column order of the test/train/dev TSV files and included some redundant metadata entries for clips that didn’t actually have valid audio.

As with 5.0, Corpus 5.1 contains all clips recorded on or before 23:59:59 UTC June 22nd, 2020. Get the latest updated versions here: https://voice.mozilla.org/datasets

Tortoise · July 15, 2020, 1:14pm

Dear @phirework (Jenny) Lead Engineer and dear @mbranson (Megan) Design Lead.

Thank you so much for kind info.
It still needs some treatment with train, dev and test tsv files.
in the middle, mismatched way of handling csv format which is making difficult to preprocess the data. In short, it is not consistent throughout the column. At some place, TAB is separator, at other space or comma is separator. like in German dataset train row 986, 1976, 2431, 3141, 15925, dev file row 347, 550, 1179, 2486, 2827, 3415, 5605, 6346, 7060 and in the end of this file, and test file after row like 1829, 2517, 3040, 3613, 7106, 7694, 10273, .is a serious matter again which I don’t know how to handle. Maybe you have automated script and can export database refined. This error could be while merging two or three datasets with different csv. Please have a look once again. Yet I have to check the files later now if all are alive with data.

phirework · July 15, 2020, 7:59pm

Hi Tortoise, all of the datasets are generated from scratch from a single source, there is no merging happening, and they were never anything other than tsv. Can you confirm that for dev row 347 you’re talking about the one with common_voice_de_19722576.mp3, and dev row 550 is the one with common_voice_de_22036339.mp3? Both of those rows have 8 tab delimiters in the file string (you can grep for the \t character) as expected, so I’m not entirely sure what you’re seeing. Can you provide more detail?

Tortoise · July 15, 2020, 8:49pm

@phire Specifically these rows which I have mentioned, are not able to handle with only \t as the separator. There is some different sortings. And python readers are throwing errors. If you at your end check the all tsv files, you will see. These lines have (single cell several rows and columns) data with some different separators. In some cases, comma works. But not for all.

The tsv setting and sorting is not consistent throughout the files train,dev,test files. This is what I want to tell you. So, due to this reason, if I try to pre-process, the idx i out of range error by csv readers.

This was not the case with previous dataset tsv files.

phirework · July 15, 2020, 9:15pm

Great, thanks for that info. Can you tell me what python library you’re using to parse?

Tortoise · July 15, 2020, 10:15pm

@phire I am using audiomate library. The point is that this issue has never been before.
I compared the previous tsv files with the 5.0/5.1 version dataset. And the problem persists, which I am sure due to this inconsistency of the columns separator. I don’t know how to handle with multi separating options, as it disturbs the rest of data containing text if I give option to comma or space which is the case with those parts. Although I know that the audiomate is interested in first three columns … ID, path and text. I tried to adjust and make the settings like last version by replacing columns. Still the row numbers have different settings so.

phirework · July 15, 2020, 11:30pm

Thanks! I appreciate your point that this is a new issue introduced in this dataset, I’m just trying to replicate the error conditions so I can investigate further.

Tortoise · July 15, 2020, 11:53pm

@phirework I have just downloaded 4th time now the dataset (today third time in row). Now, its TSV is completely change with the column order as well and its metadata info is also different. But, this looks that the rar is now change. Maybe this is the correct version. But, I can cross confirm this until tomorrow afternoon. in 12 - 24 hrs if all things are now fine.

phirework · July 16, 2020, 12:03am

Yes, the 5.0 release had a problem where the test/dev/train files had incorrect column orders, so that the columns were sorted alphabetically as “accent, age…” etc. instead of the old standard “client_id, path, sentence…” that we’ve been releasing. Corpus 5.1, the update that went out yesterday afternoon, is an update that fixes that bug. Let me know if you get different results from this one.

Tortoise · July 16, 2020, 11:57am

@phirework Thank you so much for prompt replacement of the correct version in database. It seems to work at initial phase. I hope that things will be fine. If there is anything more noticeable, I will write back to you.
Thank you so much for kind help.

Stay healthy.

mbranson · August 4, 2020, 11:54pm

The valid hours noted in this post has been updated to reflect the 5.1 release update.

Tortoise · August 21, 2020, 1:16pm

@mbranson @phirework Thank you. I have just noticed your above comment. This means I have to redownload the dataset if I have done this in July? or Everything is same?