Dataset 17 Release

And we now have the data visualisation of metadata available for the v17 release :tada:

This has been updated to reflected the amended gender categories now used.

Some interesting observations (please let me know if you have different interpretations):

  • Catalan (ca) now has more data in Common Voice than English (en) (!)

  • The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). This may change if the limits for sentence length and utterance duration are relaxed.

  • Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).

  • Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid).

2 Likes