2020 End-of-Year Common Voice Dataset Release

phirework · December 16, 2020, 6:48pm

Happy end of 2020!

While it has been a tumultuous year for all, the Common Voice team is excited to announce the end of year data set release! Firstly, we could not have made it through this year without the dedicated and passionate Common Voice community – thank you so much to all of you for your amazing work, voice donations, clip validations, code contributions and community support.

The past six months have seen continued growth of the Common Voice dataset, with an additional 2,000 hours added, 6 more languages (Hindi, Lithuanian, Luganda, Thai, Finnish, Hungarian) , and over 7 million clips total! The top languages currently are:

English: 2,179 hours
Kinyarwandan: 1,510 hours
German: 836 hours
Catalan: 745 hours
French: 682 hours
Kabyle: 622 hours
Spanish: 579 hours
Persian: 320 hours

As we mentioned in a previous Common Voice update, the platform has been in maintenance mode for the past four months (except for you, our amazing community, who have been anything but sitting in maintenance mode!). Even so, we have still committed to regular releases. In particular, we’d like to highlight:

Past datasets are now available for downloading from the datasets page, and a repo for their corresponding datasheets
Demo mode work from our Google Summer of Code project (stay tuned for more information on this soon!)
Better blank clip handling on both the front and back-end that should drastically improve the validation experience
Better handling of historical demographics data
Ongoing UI improvements, performance optimizations, and bugfixes
Ongoing sentences and localization updates

Finally, we want to reassure everyone that Common Voice has a strong and bright future ahead. The team has been hard at work behind the scenes to ensure a stable runway for the platform, data and community; we are ramping up to make announcements early next year, so please stay tuned!

In the meantime, we wish you all a relaxing end of the year, and please keep on contributing!

jmontane · December 17, 2020, 8:45am

Great!

Just a remark:

Looks like there is an issue with “gender” info. Only “female” info is on .tsv files. Where are “male” and “other”. Please, can you fix it? Thanks

Edit:

For Catalan language datasets, if I’ve parsed properly:

On CV 5.1 dataset (june 2020) we have 167,281 recordings with female gender metadata. And we have recordings with “male” and “other” gender metadata.
On CV 6.0 dataset (last one, dec 2020) we have 53,010 recordings with female gender metadata only. And “male” and “other” gender metadata are missing.

phirework · December 17, 2020, 6:39pm

Good catch, thanks! I’m working on another export now.

phirework · December 22, 2020, 11:59pm

Okay, added a dot-release for 6.1 with correct gender demographics and it’s currently available on the datasets page. Thanks for the report!

Topic		Replies	Views
Common Voice 2021 Mid-year Dataset Release! Common Voice announcements , dataset	7	2873	August 4, 2021
Common Voice 19.0 Dataset Release Common Voice	3	1358	September 20, 2024
Dataset 17 Release Common Voice	6	2525	March 22, 2024
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3923	April 21, 2020
Common Voice Dataset Release - Mid Year 2020 Common Voice announcements	15	24361	August 21, 2020

2020 End-of-Year Common Voice Dataset Release

Related topics