Hi all! We’re excited to announce that the new dataset releases for Common Voice Scripted Speech (24.0) and Spontaneous Speech (2.0) are now available for download at Mozilla Data Collective website: https://datacollective.mozillafoundation.org/datasets?q=common+voice
Highlights:
Scripted Speech:
-
Previous version: MCV Scripted Speech 23.0
-
Release date: 17th September 2025
-
Total hours: 35,921
-
Total validated hours: 24,600
-
Number of languages: 286
-
-
New version: MCV Scripted Speech 24.0
-
Release date: 17th December 2025
-
Total hours: 38,932
-
Total validated hours: 25,886
-
Number of languages: 289
-
Spontaneous Speech:
-
Previous version: Spontaneous Speech 1.0
-
Release date: 17th September 2025
-
Number of languages: 58
-
-
New version: Spontaneous Speech 2.0
- Release date: 17th December 2025
-
Number of datasets: 62
Over 3000 more hours and three new language communities, welcome to Lower Sorbian (dsb), Alsatian (gsw), and Laz (lzz)!
Congratulations on your campaigns!
Pashto, Alsatian, Irish, Galician, Kabardian, Adyghe, French, Igbo, Kurmanji Kurdish!
Datasets for Dholuo:
There are two datasets for Dholuo luo released:
- The regular MCV dataset for Dholuo, released under the CC-0 license;
- The DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset is being released under the Nwulite Obodo Open Data License (NOODL) and has been collected under the supervision of Dr. Lilian Wanzare and with funding provided by GIZ FAIR Forward.
Notes on Spontaneous Speech:
-
The English dataset will not be released this time again (we’re planning to include it in the next release) because of quality issues.
-
Demographics data will not be provided for the datasets in this release; we’re, again, planning to include them in the next release.
-
Quality tags will be again given for each record in the datasets; you can read more about them here: https://discourse.mozilla.org/t/community-feedback-request-introducing-quality-tags-for-common-voice-spontaneous-speech-we-want-to-hear-from-you/146127
Here’s the list of the new tags:
non-allowed-script - Tag for transcriptions containing a writing system not associated with the language, e.g. latin letters in Kabardian;mixed-script-words - Tag for transcriptions containing multiple writing systems at the word/token level, e.g. mix of latin and arabic letters in one word in Arabic;mixed-script-transcription - Tag for transcriptions containing multiple writing systems, but each word/token consistently uses a single script, e.g. a when a word written in latin script follows a word written in arabic script in Kurmanji Kurdish. -
The disfluency tags got standardised, so it will be easier to filter out the disfluencies in your code.
Thank you!
We’re all so proud to work supporting such dedicated language communities, thank you all so much.
Get the newest release at:
https://datacollective.mozillafoundation.org/datasets?q=common+voice