Release live: MCV Scripted Speech v26.0 and Spontaneous Speech v4.0

Hello everyone! The latest versions of Common Voice Scripted Speech (v26.0) and Spontaneous Speech (v4.0) are available for download at Mozilla Data Collective: https://mozilladatacollective.com/organization/cmfh0j9o10006ns07jq45h7xk. In this release, 10 new datasets and 7 new languages have been added!

Highlights:

Scripted Speech

  • An additional 598 hours (~593K clips) of audio in the datasets

  • Welcoming 4 more datasets: Abaza (abq), Khakas (kjh), Khmer (km), and Afaan Oromo (om)

  • The fastest-growing datasets this quarter are Pashto (ps), Georgian (ka), Afaan Oromo (om), Laz (lzz), Sindhi (sd), Abaza (abq), Ormuri (om), and Kabardian (kbd)

Spontaneous Speech

  • An additional 28.7 hours (5,894 clips) of audio in the datasets

  • Welcoming 6 more datasets: Amharic (am), Lango (laj), Mon (mnw), Afaan Oromo (om), Shimaore (swb), and Tatar (tt)

  • The fastest-growing datasets this quarter are Georgian (ka), Javanese (jv), Pashto (ps), Alsatian (gsw), and Tashlhiyt (shi)

Many datasets have recordings waiting for validation. About 25% of audio contributions in Scripted Speech and about 40% of audio contributions in Spontaneous Speech need to be validated. You can help review at https://commonvoice.mozilla.org!

Thank you everyone for your hard work and contributions ahead of this release :slightly_smiling_face:

For communication or any questions, please reach out to us on Discourse, Matrix, or Discord!

1 Like