Dataset release: MCV 14

Good morning (or evening, or afternoon) to the delightful Common Voice community of contributors, dataset users and folks hanging out with us to learn more about language and technology.

It’s one of my favorite times of the year, it’s time for another dataset release!

Live and ready to download at:

Mozilla Common Voice 14 is live and we’re so excited that we have 28117 hours of speech data, of which 18651 hours are validated.

I love to see new languages available, so join me in welcoming Pashto, Albanian, Amharic and Standard Moroccan Amazigh to the platform and dataset.

We now have a total of 112 languages live and we would be so excited to welcome more.Please shout if you have any questions, suggestions or want to celebrate along!

So many thanks to the sentence, voice and technical contributors who made this possible.


And the metadata coverage graphs are available at:

:tada: :tada: :tada:

:tada: :tada: We thank the MCV community for continued efforts and time :tada: :tada:


I have a question about the dataset delta releases. The english delta 14 should contain 71 recorded and 56 validated hours. However the validated.tsv only contains 4035 entries. That is only 10% of the 44035 clips. It feels like I’m missing a lot of data if I only add validates clips with each delta release.

@yodakohl, correct… There is a discrepancy…

This is also the same in the metadata (~56 hours):

I checked the en delta, and there are 44035 .mp3 files, and this matches the sums of records in validated/invalidated/other buckets. Many of the recordings are in other, i.e. recorded but not validated yet…

If the “validated hours” figure on the download page and metadata is not correct, that means you will not lose any data if you use the validated bucket. If the buckets are not created correctly, then we all are in trouble.

You may like to open an issue on Github…

Thank you so much for flagging this, I’m investigating this with the engineering team now and we should know more shortly.