Common Voice 18 is here

We’re so excited to be releasing this quarter’s freshest multilingual speech data. This release brings us up to 31,000 hours of speech in 128 languages, all available for download here :sparkler:

I’m especially excited to welcome Xhosa, Kalenjin, Kidaw’ida, Dholuo and Setswana to Common Voice 18.

More information on our blog, and as always: feedback is so warmly welcomed!

3 Likes

As soon as the JSON file is available, I will do the metadata coverage visualisation.

2 Likes

Yes, the missing cv-dataset repo update and v18.0 Georgian dataset also prevent me from going further. Some feedback and an estimation would be very much appreciated.

The data visualisation of metadata coverage for all languages in the CV dataset for v18 is now available.

Some of my observations are:

  • Catalan (ca) now has a larger dataset than English, based on the number of audio recordings (including validated and yet-to-be-validated recordings). It’s also an interesting dataset because the number of recordings per unique contributor is relatively low (around 80). This means it’s likely to have a high diversity of speakers in the dataset, which is useful for building ASR models that generalise well to many speakers. Catalan also appears to have the highest percentage of audio recordings by older speakers - e.g. speakers in their forties, fifties and older. Again, this highlights the diversity of speakers in the Catalan dataset.

  • Although it’s very early to see any trends from the decision by Common Voice to expand the range of options for gender identity, we are starting to see some data being tagged with the new options that are available. For example, in Uyghur (ug), we now have data tagged as “do not wish to say”. I don’t want to draw connections between the geopolitical situation in that area and the desire of data contributors not to provide demographic data which may in some way identify them without more evidence, but I think it’s telling that the first use of these expanded metadata categories appears in a language that is spoken in a contested geography.

  • Similarly, it’s very early to identify trends in sentence domain classification - as most of the sentences that do have a domain tag are labelled “general”, although “health_care” sentences are occurring frequently in languages such as Albanian (sq).

  • Bangla (Bengali) (bn) continues to have a very large number of yet-to-be-validated audio recordings. Due to this, the train split for Bangla is quite small.

  • Dholuo (luo), a language spoken in Kenya and Tanzania, is an outlier in terms of the number of distinct data contributors to the dataset - this language has a very high average number of contributions for per contributor. This is often seen in languages that are new to Common Voice, before they have been able to recruit more contributors. Dholuo has nearly 5 million speakers.

  • The language with the highest average utterance duration is by far Icelandic (is) at over 7 seconds. This may be because Icelandic has many words with several syllables, which take longer to pronounce. Consider the cat sat on the mat in English, cf kötturinn sat á mottunni in Icelandic.

1 Like

I cannot say what is too much, AFAIK it really depends on the model and workflow, but Catalan has some people recording many sentences, so the variance is high. In any case, one person recording >131k sentences is huge…

Also, Icelandic has only 4 speakers with 40 recordings total, so I’d give the flag to Estonian (et) with 6.930 sec/rec average :slight_smile: I don’t think anything below 1000 can represent some statistical value here.

1 Like

Excellent points, as alwasy, @bozden!

1 Like

@kathyreid, good point. “General” is equivalent to no domain. So, to get some statistics, we need to exclude them. Because of the bulk sentence additions marked as “general”, it will always become abundant.

Here is what I get (“with domain” column does not include “general”):

BTW, it seems some information has been lost between v17.0 and v18.0 where the naming of options have changed. During a teaching session we added some domain specific sentences and they are not reflected to the statistics. I map them between changes, I must check it…

1 Like