Common Voice 19.0 Dataset Release

The Common Voice team is so delighted to present the 19.0 dataset release. This release added an additional 463 hours of clips, taking the dataset to a total of 32,584 hours of open speech data that’s free to use. This release also saw a notable increase in validated hours adding 650 hours of new validated clips, taking the total duration of validated clips in Common Voice 19.0 to 21,593.

Two new languages joined the dataset with this release! We’re delighted to welcome Sindhi and Tsonga to the Common Voice dataset for the first time. This brings the total number of languages in the Common Voice dataset to 131. This might sound impressive, but with over 7000 languages spoken in the world today, we’re just getting started. If you would like to see your language on Common Voice, please get in touch and let us know about it.

You can download Common Voice 19.0 at our Dataset Download page.

You may notice that a handful of languages that are recent additions to the platform are not being released. They will be part of a special release around May 2025 as part of the launch for a new platform and new data format. This was pre-agreed with the community researchers who are working on these datasets, and we’re excited to announce more soon!

As always, our thanks to the countless voice and text contributors, validators and community members who make up the dataset and the heart of the Common Voice efforts. None of this would be possible without you and we’re so excited to continue to grow with the community.

2 Likes