Dataset release: MCV 14

jesslynnrose · June 30, 2023, 11:13am

Good morning (or evening, or afternoon) to the delightful Common Voice community of contributors, dataset users and folks hanging out with us to learn more about language and technology.

It’s one of my favorite times of the year, it’s time for another dataset release!

Live and ready to download at: https://commonvoice.mozilla.org/en/datasets

Mozilla Common Voice 14 is live and we’re so excited that we have 28117 hours of speech data, of which 18651 hours are validated.

I love to see new languages available, so join me in welcoming Pashto, Albanian, Amharic and Standard Moroccan Amazigh to the platform and dataset.

We now have a total of 112 languages live and we would be so excited to welcome more.Please shout if you have any questions, suggestions or want to celebrate along!

So many thanks to the sentence, voice and technical contributors who made this possible.

kathyreid · July 1, 2023, 10:44pm

And the metadata coverage graphs are available at:

gina · July 4, 2023, 6:52am

We thank the MCV community for continued efforts and time

yodakohl · July 7, 2023, 2:11pm

I have a question about the dataset delta releases. The english delta 14 should contain 71 recorded and 56 validated hours. However the validated.tsv only contains 4035 entries. That is only 10% of the 44035 clips. It feels like I’m missing a lot of data if I only add validates clips with each delta release.

bozden · July 7, 2023, 2:30pm

@yodakohl, correct… There is a discrepancy…

This is also the same in the metadata (~56 hours):

I checked the en delta, and there are 44035 .mp3 files, and this matches the sums of records in validated/invalidated/other buckets. Many of the recordings are in other, i.e. recorded but not validated yet…

If the “validated hours” figure on the download page and metadata is not correct, that means you will not lose any data if you use the validated bucket. If the buckets are not created correctly, then we all are in trouble.

You may like to open an issue on Github…

jesslynnrose · July 11, 2023, 1:29pm

Thank you so much for flagging this, I’m investigating this with the engineering team now and we should know more shortly.

Topic		Replies	Views
Dataset 17 Release Common Voice	6	2530	March 22, 2024
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3924	April 21, 2020
Dataset 13 release 🎉 Common Voice dataset , updates	3	1643	March 20, 2023
Common Voice 19.0 Dataset Release Common Voice	3	1361	September 20, 2024
Common Voice 22.0 release 🎉 Common Voice	0	545	June 25, 2025

Dataset release: MCV 14

Related topics