Dataset 13 release 🎉

jesslynnrose · March 16, 2023, 10:57am

It’s time again, Community Voice community! We’re bringing you a brand new version of the dataset with the freshest data! Let’s look at what’s in here:

Dataset cutoff time/date:

This dataset includes contributions made up until 2023-03-09, so contributions made after March 9th won’t be included in this dataset and can be found instead in our next data release.

Cutoff date for samples

This dataset contains submissions to to 2023-03-09. Data collected after March 9th, 2023 will be made available in a future version of the dataset.

If you want to skip reading all the details and jump right to the data, all our datasets are available here.

What’s in 13.0

New Languages

We’re so excited to have Turkmen(tk), Lao(lo), Dioula(dyu) and Icelandic(is) joining Common Voice in this dataset. Every new language added helps make sure that new technology projects and programs can meet people in the languages they are most comfortable with. Thanks so much to the community members who suggested these languages. If you’re looking for a language we don’t currently support, get in touch!

More data!
13.0 brings you an additional 1020 hours (from 26119 hours in 12.0 to 27139 hours now). We also have a big jump in our validated data: this release contains an additional 542 hours of validated data (from 17145 hours in v12 to 17687).

Accent bug is fixed in this release

Some helpful community members pointed out that we had some missing accent data in v12, this has been fixed for this data release. Enjoy your accent data and apologies again for the bug in the last version.

What’s next

We’ll keep releasing more datasets, alongside some exciting product updates streamlining the sentence collector experience. As always, thank you so much for your contributions, bug reports, suggestions and for pointing us at new languages. I know you’re going to continue to build amazing things with this new data, I can’t wait to see what you come up with next!

kathyreid · March 16, 2023, 1:05pm

Dataset visualisations have been updated based on the cv-dataset repository and are available at:

bozden · March 16, 2023, 10:30pm

Updated the Common Voice Metadata Viewer with new data…

Common Voice Dataset Analyzer needs a couple more days…

bozden · March 20, 2023, 1:37pm

Opened the Common Voice Metadata Viewer and Common Voice Dataset Analyzer on a new server, under a new domain - and both have been updated (also fixed some simple bugs)…

As this release did not include the default splits, I had to create them. I’ll provide them as a convenience as downloadable, needs a couple of days.

Topic		Replies	Views
Common Voice Dataset V.11 Common Voice	5	2778	October 4, 2022
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3940	April 21, 2020
Common Voice 19.0 Dataset Release Common Voice	3	1382	September 20, 2024
Common Voice 22.0 release 🎉 Common Voice	0	562	June 25, 2025
Common Voice 2021 Mid-year Dataset Release! Common Voice announcements , dataset	7	2881	August 4, 2021

Dataset 13 release 🎉

Related topics