Dataset 13 release 🎉

It’s time again, Community Voice community! We’re bringing you a brand new version of the dataset with the freshest data! Let’s look at what’s in here:

Dataset cutoff time/date:

This dataset includes contributions made up until 2023-03-09, so contributions made after March 9th won’t be included in this dataset and can be found instead in our next data release.

Cutoff date for samples

This dataset contains submissions to to 2023-03-09. Data collected after March 9th, 2023 will be made available in a future version of the dataset.

If you want to skip reading all the details and jump right to the data, all our datasets are available here.

What’s in 13.0

New Languages

We’re so excited to have Turkmen(tk), Lao(lo), Dioula(dyu) and Icelandic(is) joining Common Voice in this dataset. Every new language added helps make sure that new technology projects and programs can meet people in the languages they are most comfortable with. Thanks so much to the community members who suggested these languages. If you’re looking for a language we don’t currently support, get in touch!

More data!
13.0 brings you an additional 1020 hours (from 26119 hours in 12.0 to 27139 hours now). We also have a big jump in our validated data: this release contains an additional 542 hours of validated data (from 17145 hours in v12 to 17687).

Accent bug is fixed in this release

Some helpful community members pointed out that we had some missing accent data in v12, this has been fixed for this data release. Enjoy your accent data and apologies again for the bug in the last version.

What’s next

We’ll keep releasing more datasets, alongside some exciting product updates streamlining the sentence collector experience. As always, thank you so much for your contributions, bug reports, suggestions and for pointing us at new languages. I know you’re going to continue to build amazing things with this new data, I can’t wait to see what you come up with next!

5 Likes

Dataset visualisations have been updated based on the cv-dataset repository and are available at:

4 Likes

Updated the Common Voice Metadata Viewer with new data…

Common Voice Dataset Analyzer needs a couple more days…

2 Likes

Opened the Common Voice Metadata Viewer and Common Voice Dataset Analyzer on a new server, under a new domain - and both have been updated (also fixed some simple bugs)…

As this release did not include the default splits, I had to create them. I’ll provide them as a convenience as downloadable, needs a couple of days.

2 Likes