Hello everyone! The latest versions of Common Voice Scripted Speech (v26.0) and Spontaneous Speech (v4.0) are available for download at Mozilla Data Collective: https://mozilladatacollective.com/organization/cmfh0j9o10006ns07jq45h7xk. In this release, 10 new datasets and 7 new languages have been added!
Highlights:
Scripted Speech
-
An additional 598 hours (~593K clips) of audio in the datasets
-
Welcoming 4 more datasets: Abaza (abq), Khakas (kjh), Khmer (km), and Afaan Oromo (om)
-
The fastest-growing datasets this quarter are Pashto (ps), Georgian (ka), Afaan Oromo (om), Laz (lzz), Sindhi (sd), Abaza (abq), Ormuri (om), and Kabardian (kbd)
Spontaneous Speech
-
An additional 28.7 hours (5,894 clips) of audio in the datasets
-
Welcoming 6 more datasets: Amharic (am), Lango (laj), Mon (mnw), Afaan Oromo (om), Shimaore (swb), and Tatar (tt)
-
The fastest-growing datasets this quarter are Georgian (ka), Javanese (jv), Pashto (ps), Alsatian (gsw), and Tashlhiyt (shi)
Many datasets have recordings waiting for validation. About 25% of audio contributions in Scripted Speech and about 40% of audio contributions in Spontaneous Speech need to be validated. You can help review at https://commonvoice.mozilla.org!
Thank you everyone for your hard work and contributions ahead of this release ![]()
For communication or any questions, please reach out to us on Discourse, Matrix, or Discord!