It’s time again, Community Voice community! We’re bringing you a brand new version of the dataset with the freshest data! Let’s look at what’s in here:
Dataset cutoff time/date:
This dataset includes contributions made up until 2023-03-09, so contributions made after March 9th won’t be included in this dataset and can be found instead in our next data release.
Cutoff date for samples
This dataset contains submissions to to 2023-03-09. Data collected after March 9th, 2023 will be made available in a future version of the dataset.
If you want to skip reading all the details and jump right to the data, all our datasets are available here.
What’s in 13.0
New Languages
We’re so excited to have Turkmen(tk), Lao(lo), Dioula(dyu) and Icelandic(is) joining Common Voice in this dataset. Every new language added helps make sure that new technology projects and programs can meet people in the languages they are most comfortable with. Thanks so much to the community members who suggested these languages. If you’re looking for a language we don’t currently support, get in touch!
More data!
13.0 brings you an additional 1020 hours (from 26119 hours in 12.0 to 27139 hours now). We also have a big jump in our validated data: this release contains an additional 542 hours of validated data (from 17145 hours in v12 to 17687).
Accent bug is fixed in this release
Some helpful community members pointed out that we had some missing accent data in v12, this has been fixed for this data release. Enjoy your accent data and apologies again for the bug in the last version.
What’s next
We’ll keep releasing more datasets, alongside some exciting product updates streamlining the sentence collector experience. As always, thank you so much for your contributions, bug reports, suggestions and for pointing us at new languages. I know you’re going to continue to build amazing things with this new data, I can’t wait to see what you come up with next!