Dataset 17 Release

kathyreid · March 21, 2024, 12:48am

And we now have the data visualisation of metadata available for the v17 release

This has been updated to reflected the amended gender categories now used.

Some interesting observations (please let me know if you have different interpretations):

Catalan (ca) now has more data in Common Voice than English (en) (!)
The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). This may change if the limits for sentence length and utterance duration are relaxed.
Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).
Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid).

Topic		Replies	Views
Common Voice 19.0 Dataset Release Common Voice	3	1366	September 20, 2024
Dataset release: MCV 14 Common Voice	5	1120	July 11, 2023
2020 End-of-Year Common Voice Dataset Release Common Voice announcements	3	3336	December 22, 2020
Common Voice 18 is here Common Voice	6	829	July 14, 2024
4200h Voice Dataset Release: More Than 4,200 Common Voice Hours Now Ready For Download Common Voice announcements , dataset	20	3925	April 21, 2020