2020 End-of-Year Common Voice Dataset Release

Happy end of 2020!

While it has been a tumultuous year for all, the Common Voice team is excited to announce the end of year data set release! Firstly, we could not have made it through this year without the dedicated and passionate Common Voice community – thank you so much to all of you for your amazing work, voice donations, clip validations, code contributions and community support.

The past six months have seen continued growth of the Common Voice dataset, with an additional 2,000 hours added, 6 more languages (Hindi, Lithuanian, Luganda, Thai, Finnish, Hungarian) , and over 7 million clips total! The top languages currently are:

  • English: 2,179 hours
  • Kinyarwandan: 1,510 hours
  • German: 836 hours
  • Catalan: 745 hours
  • French: 682 hours
  • Kabyle: 622 hours
  • Spanish: 579 hours
  • Persian: 320 hours

As we mentioned in a previous Common Voice update, the platform has been in maintenance mode for the past four months (except for you, our amazing community, who have been anything but sitting in maintenance mode!). Even so, we have still committed to regular releases. In particular, we’d like to highlight:

  • Past datasets are now available for downloading from the datasets page, and a repo for their corresponding datasheets
  • Demo mode work from our Google Summer of Code project (stay tuned for more information on this soon!)
  • Better blank clip handling on both the front and back-end that should drastically improve the validation experience
  • Better handling of historical demographics data
  • Ongoing UI improvements, performance optimizations, and bugfixes
  • Ongoing sentences and localization updates

Finally, we want to reassure everyone that Common Voice has a strong and bright future ahead. The team has been hard at work behind the scenes to ensure a stable runway for the platform, data and community; we are ramping up to make announcements early next year, so please stay tuned!

In the meantime, we wish you all a relaxing end of the year, and please keep on contributing!

6 Likes

Great!

Just a remark:

Looks like there is an issue with “gender” info. Only “female” info is on .tsv files. Where are “male” and “other”. Please, can you fix it? Thanks

Edit:

For Catalan language datasets, if I’ve parsed properly:

  • On CV 5.1 dataset (june 2020) we have 167,281 recordings with female gender metadata. And we have recordings with “male” and “other” gender metadata.
  • On CV 6.0 dataset (last one, dec 2020) we have 53,010 recordings with female gender metadata only. And “male” and “other” gender metadata are missing.

Good catch, thanks! I’m working on another export now.

3 Likes

Okay, added a dot-release for 6.1 with correct gender demographics and it’s currently available on the datasets page. Thanks for the report!

4 Likes