Common Voice Dataset V.11

ftyers · September 21, 2022, 8:52pm

Hey all!

I’d like to announce the release of Common Voice 11, the eleventh release of the Common Voice dataset. The dataset now contains 24,210 hours an increase of over 16% compared to the last release!

In this edition we have reached 100 languages, including 4 new languages:

Hill Mari (mrj), Saraiki (skr), Tigrinya (ti), Twi (tw)

There are thirty three languages with over 100 hours of data collected, and newly over 100 hours in this release are:

Persian (fa) and Uyghur (ug)

Seven languages have over 45% of their gender tags as female:

Abkhaz (ab), Maltese (mt), Dhivehi (dv), Serbian (sr), Marathi (mr), Meadow Mari (mhr) and Hill Mari (mrj)

You can:

Access the dataset: https://commonvoice.mozilla.org/datasets
Access the metadata: https://github.com/common-voice/cv-dataset

If you have any questions, please feel free to get in contact with us

Thank you everyone for your work!

Francis M. Tyers
Linguistic advisor for Common Voice

kathyreid · September 22, 2022, 12:44am

I’ve updated the dataset coverage visualisations:

bozden · September 22, 2022, 1:35am

Hey @kathyreid, thanks for the update, these are wonderful visualizations.

About your splits question:

An unknown
One of the things I will need to follow up with Gabe is whether duplicated sentences are removed from the splits.

The original CorporaCreator should be used here… If you are referring to multiple recordings per sentence by saying “duplicated sentences”, no, each sentence lives in one split in this case.

If you are referring to case differences, like “I’m in Paris.” vs “I’m in paris”, I didn’t see any commit related to this, also shouldn’t be implemented right away IMHO…

But… Did you DL the whole set of 100 languages to generate these?

kathyreid · September 22, 2022, 1:57am

Good points, thanks @bozden! When I get a chance I will update the dataviz - I forked it from the v9 one. This data was generated straight from the Corpora Creator-generated JSON at:

bozden · September 22, 2022, 2:02am

Oh, of course These are aggregated values…

laughleftlaughright · October 4, 2022, 11:54pm

Hey @kathyreid , thank you so much for the wonderful visualisations! Common Voice Cantonese (Yue) is using your charts as reference to introduce the statistics of CV 11.0 on our Instagram page (@/commonvoice.yue). Your work has been a great help to us!

Topic		Replies	Views
Dataset 13 release 🎉 Common Voice dataset , updates	3	1656	March 20, 2023
Common Voice Dataset V.9 Common Voice announcements	0	3424	April 27, 2022
Common Voice 19.0 Dataset Release Common Voice	3	1400	September 20, 2024
2020 End-of-Year Common Voice Dataset Release Common Voice announcements	3	3346	December 22, 2020
Common Voice 2021 Mid-year Dataset Release! Common Voice announcements , dataset	7	2889	August 4, 2021

Common Voice Dataset V.11

Related topics