Hey all!
I’d like to announce the release of Common Voice 11, the eleventh release of the Common Voice dataset. The dataset now contains 24,210 hours an increase of over 16% compared to the last release!
In this edition we have reached 100 languages, including 4 new languages:
- Hill Mari (
mrj
), Saraiki (skr
), Tigrinya (ti
), Twi (tw
)
There are thirty three languages with over 100 hours of data collected, and newly over 100 hours in this release are:
- Persian (
fa
) and Uyghur (ug
)
Seven languages have over 45% of their gender tags as female:
- Abkhaz (
ab
), Maltese (mt
), Dhivehi (dv
), Serbian (sr
), Marathi (mr
), Meadow Mari (mhr
) and Hill Mari (mrj
)
You can:
- Access the dataset: https://commonvoice.mozilla.org/datasets
- Access the metadata: https://github.com/common-voice/cv-dataset
If you have any questions, please feel free to get in contact with us
Thank you everyone for your work!
Francis M. Tyers
Linguistic advisor for Common Voice