Common Voice Dataset V.11

Hey all!

I’d like to announce the release of Common Voice 11, the eleventh release of the Common Voice dataset. The dataset now contains 24,210 hours an increase of over 16% compared to the last release! :rocket:

In this edition we have reached 100 languages, including 4 new languages:

  • Hill Mari (mrj), Saraiki (skr), Tigrinya (ti), Twi (tw)

There are thirty three languages with over 100 hours of data collected, and newly over 100 hours in this release are:

  • Persian (fa) and Uyghur (ug)

Seven languages have over 45% of their gender tags as female:

  • Abkhaz (ab), Maltese (mt), Dhivehi (dv), Serbian (sr), Marathi (mr), Meadow Mari (mhr) and Hill Mari (mrj)

You can:

If you have any questions, please feel free to get in contact with us :slight_smile:

Thank you everyone for your work!

Francis M. Tyers
Linguistic advisor for Common Voice

6 Likes

I’ve updated the dataset coverage visualisations:

4 Likes

Hey @kathyreid, thanks for the update, these are wonderful visualizations.

About your splits question:

An unknown
One of the things I will need to follow up with Gabe is whether duplicated sentences are removed from the splits.

The original CorporaCreator should be used here… If you are referring to multiple recordings per sentence by saying “duplicated sentences”, no, each sentence lives in one split in this case.

If you are referring to case differences, like “I’m in Paris.” vs “I’m in paris”, I didn’t see any commit related to this, also shouldn’t be implemented right away IMHO…

But… Did you DL the whole set of 100 languages to generate these?

1 Like

Good points, thanks @bozden! When I get a chance I will update the dataviz - I forked it from the v9 one. This data was generated straight from the Corpora Creator-generated JSON at:

1 Like

Oh, of course :slight_smile: These are aggregated values…

1 Like