Common Voice 19.0 Dataset Release

The Common Voice team is so delighted to present the 19.0 dataset release. This release added an additional 463 hours of clips, taking the dataset to a total of 32,584 hours of open speech data that’s free to use. This release also saw a notable increase in validated hours adding 650 hours of new validated clips, taking the total duration of validated clips in Common Voice 19.0 to 21,593.

Two new languages joined the dataset with this release! We’re delighted to welcome Sindhi and Tsonga to the Common Voice dataset for the first time. This brings the total number of languages in the Common Voice dataset to 131. This might sound impressive, but with over 7000 languages spoken in the world today, we’re just getting started. If you would like to see your language on Common Voice, please get in touch and let us know about it.

You can download Common Voice 19.0 at our Dataset Download page.

You may notice that a handful of languages that are recent additions to the platform are not being released. They will be part of a special release around May 2025 as part of the launch for a new platform and new data format. This was pre-agreed with the community researchers who are working on these datasets, and we’re excited to announce more soon!

As always, our thanks to the countless voice and text contributors, validators and community members who make up the dataset and the heart of the Common Voice efforts. None of this would be possible without you and we’re so excited to continue to grow with the community.

2 Likes

I can see Sindhi (sd) and Xitsonga (ts) are the new additions and they are not released yet. It seems they both have 0h validated for now. They come from datasets API, so I need to manually exclude them in my code.
Total downloadable dataset count is still 129.

And here is the tasty data visualisation for the v19 release :tada:

I’ve updated the visualisation to have the names of the languages rather than just their language codes to make it more easily readable.

My observations:

:arrow_forward: Catalan (ca) continues to be leader in terms of data - speaking volumes about the efforts to revitalise culture and language in Catalunya. It’s also one of the few languages that has data for all age groups, particularly older speakers - this sort of data is missing for most other languages.

:arrow_forward: Kiswahili (sw) is one of the languages where there is more data for female-identifying speakers than for male-identifying speakers :female_sign: - although Japanese (ja), Western Mari (mrj) and Luganda (lg) do pretty well here, too!

:arrow_forward: Sentence domains can now be categorised, and although most new sentences are “general”, Albanian (sq) has a lot of sentences related to law and government.

:arrow_forward: Tsonga (ts), a Bantu language spoken in Southern Africa, has dethroned Icelandic (is) as the language with the highest average utterance duration. I don’t know enough about Tsonga to speculate why - it’s a somewhat agglutinative language, but many Tsonga works are generally short.

:arrow_forward: Bengali / Bangla (bn) has a significant amount of data that is not yet validated, and therefore does not appear in training / dev / test splits. There is a similar case for many languages new to Common Voice - it takes time to validate.

:arrow_forward: The language with the highest number of average contributions per speaker is Taita (dav), a Bantu language from Kenya.

1 Like

Kudos to Catalans! Admirable success!

When I examined the data with the new delta page, I could see how it happened in time: Between v8.0 and v9.0 they had a huge campaign and recorded 320h/mo with ~6000 new users/mo. Although from there on a medium sized CV-dataset-sized data is added, the main emphasis shifted to validation of those ~1000h campaign recordings.

Recordings not-yet-validated are still a problem in global CV. One third of the recordings of 11k-hours worth are just waiting there…

1 Like