And we now have the data visualisation of metadata available for the v17 release
This has been updated to reflected the amended gender categories now used.
Some interesting observations (please let me know if you have different interpretations):
-
Catalan (
ca
) now has more data in Common Voice than English (en
) (!) -
The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (
is
). This may change if the limits for sentence length and utterance duration are relaxed. -
Spanish (
es
), Bangla (Bengali) (bn
), Mandarin Chinese (zh-CN
) and Japanese (ja
) all have a lot of recorded utterances that have not yet been validated. Albanian (sq
) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv
). -
Votic (
vot
) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid).