Hello Common Voice Community!
We are excited to announce the second dataset release in 2022 - Common Voice 9!
Your incredible contributions and community activities have made this latest version of the Common Voice Dataset possible. You can download the Common Voice dataset here for free. The dataset is now more than 20,000 hours!
It has doubled this year, and has 94 languages - the most diverse multilingual speech corpus in the world.
This release wouldn’t be possible without YOU — from voice donations to initiating their language in our project, to opening new opportunities for people to build voice technology tools that can support every language spoken across the world.
Access the dataset: https://commonvoice.mozilla.org/datasets
Access the metadata: https://github.com/common-voice/cv-dataset
Dataset Highlights
-
Twenty seven languages now have at least 100 hours of speech data. They include Bengali, Thai, Basque, and Frisian.
-
Nine languages now have at least 45% of their gender tags as female. They include Marathi, Dhivehi, and Luganda.
-
We’re excited to welcome the languages of Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese to the dataset.
-
We would also like to congratulate Igbo, Catalan, Urdu, Norwegian Nynorsk and Marathi communities for their amazing dataset growth.
Community Spotlight: Bengali
“In densely populated South Asia, the internet is more scalable than language education. Bengali.AI joined the common voice initiative to enable tech inclusivity for Bengali speakers via speech AI. Through social media campaigns starting from the International Mother Language Day of 2022, we have been able to mobilize up to 20K people into contributing 300+ hours of speech at the common voice platform.” Written by Imitaz, Bengali Langauge Contributor
What could I do next?
Care about tech being more inclusive? Share via your social media:
I am a #CommonVoice contributor, we are making voice technology better for languages spoken across the world. Join us by visiting commonvoice.mozilla.org/
Already using the Common Voice dataset?
Let us know what you’re building via social media using #CommonVoice hashtag or Community Discourse
On behalf of the Common Voice Team Thank you !
Hillary Juma,
Common Voice Community Manager, Mozilla Foundation