Common Voice 21 dataset now available

We’re delighted to announce that the Common Voice 21 dataset is now available for release :tada:

Common Voice now hosts 134 languages, with nearly 33,500 hours of speech from over 350,000 distinct speakers.

In this release, we’re delighted to welcome Norwegian Bokmål - one of two languages that are the official languages of Norway - the other being Nynorsk. Nynorsk and Bokmål have different heritages - like many similar languages do! Bokmål - literally “book language” is heavily influenced by Danish, from the period when Norway was a part of Denmark. Nynorsk - “New Norwegian” - is spoken more in the western and rural parts of Norway while Bokmål is spoken mainly in urban and eastern areas. A big “hei” to all our Bokmål contributors :wave:

A huge thank you to all the data contributors, language leads and communities for making this possible.

2 Likes

Thank you to the team and volunteers!

I noticed that many (all?) of the new languages this quarter don’t yet appear on the downloads page, for example Kabardian, Sakizaya, … even though they’ve been showing over 15 hours recorded on the “languages” board for several weeks now. Any word on when the data for these languages might be available for download? Thank you!

Hi @cjbaker,

The contributors for these languages have been very active and we’re in the final stages of reviewing data for them to be released. The data didn’t quite make the cutoff for the v21 dataset release, but we’re expecting them to be released in June with v22.

If you’d like more information, please don’t hesitate to drop us a line at commonvoice at mozilla dot com.

1 Like

Hello Everyone,

I’m currently working on a project involving speech recognition and am particularly interested in integrating languages like Arabic, Persian, and Pashto into the DeepSpeech framework.

I know that Mozilla’s Common Voice initiative has made impressive strides in collecting voice datasets for various languages. Could anyone let me know if there are existing datasets for Arabic, Persian, or Pashto available in Common Voice or if there are plans to support them in the future? Any information on these languages, whether it’s pre-existing data or community-led efforts, would be greatly appreciated.

Thanks in advance for your help!

Hi @Santosh_Shetty, firstly thanks for your interest in Common Voice.

Arabic, Persian and Pashto are all currently available for contribution and as dataset downloads within Common Voice.

If you email us at commonvoice@mozilla.com we can put you in contact with the Language Community leads for these languages.