Common Voice 21 dataset now available

We’re delighted to announce that the Common Voice 21 dataset is now available for release :tada:

Common Voice now hosts 134 languages, with nearly 33,500 hours of speech from over 350,000 distinct speakers.

In this release, we’re delighted to welcome Norwegian Bokmål - one of two languages that are the official languages of Norway - the other being Nynorsk. Nynorsk and Bokmål have different heritages - like many similar languages do! Bokmål - literally “book language” is heavily influenced by Danish, from the period when Norway was a part of Denmark. Nynorsk - “New Norwegian” - is spoken more in the western and rural parts of Norway while Bokmål is spoken mainly in urban and eastern areas. A big “hei” to all our Bokmål contributors :wave:

A huge thank you to all the data contributors, language leads and communities for making this possible.

2 Likes

Thank you to the team and volunteers!

I noticed that many (all?) of the new languages this quarter don’t yet appear on the downloads page, for example Kabardian, Sakizaya, … even though they’ve been showing over 15 hours recorded on the “languages” board for several weeks now. Any word on when the data for these languages might be available for download? Thank you!

Hi @cjbaker,

The contributors for these languages have been very active and we’re in the final stages of reviewing data for them to be released. The data didn’t quite make the cutoff for the v21 dataset release, but we’re expecting them to be released in June with v22.

If you’d like more information, please don’t hesitate to drop us a line at commonvoice at mozilla dot com.

1 Like