Dataset 17 Release

jesslynnrose · March 19, 2024, 2:09pm

The Common Voice team is so excited to be releasing the 17.0 Common Voice dataset, made possible by our voice and text corpus contributors, language community activists, open source contributors and countless other community members. Thank you all so much for making this possible.

The Common Voice speech corpus is now a dazzling 31,000 hours of speech clips. This is an increase of 847 hours since our last release. This release also adds 493 hours of validated clips to the new dataset!

Clips in Haitian Creole, Nso, Zulu and Zaza join the Common Voice dataset for the first time with this release.

This dataset is inclusive of data collected through March 14th, 2024. Data collected after March 14th will be included in the next dataset release.

Dataset releases are quarterly, and we expect to see 18.0 released in June 2024.

jesslynnrose · March 19, 2024, 2:17pm

jesslynnrose · March 20, 2024, 12:21pm

kathyreid · March 21, 2024, 9:31pm

And we now have the data visualisation of metadata available for the v17 release

This has been updated to reflected the amended gender categories now used.

Some interesting observations (please let me know if you have different interpretations):

Catalan (ca) now has more data in Common Voice than English (en) (!)
The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). This may change if the limits for sentence length and utterance duration are relaxed.
Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).
Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid).

bozden · March 21, 2024, 8:57am

Thank you Team ! This is one of the most thrilling versions. Towards this version, some major changes happened:

Recording duration is extended to 15s from 10s (this would need other changes in rule files for example - making 14 words 20/21)
Gender information is changed/detailed
Text corpora can now have domain information, thus sentence_domain field is added.
In the metadata, where a sentence exist, the sentence_id has been added. This is a cryptographic hash generated in JS - not easy to replicate in Python - very costly. Thus, when used as an index, the text corpus analysis scripts can fly.
And most importantly: We now have full Text Corpus as part of a release. Until now, we tried to combine them github files and from metadata (after March 2023). With v17.0 we have all validated and invalidated sentences with some basic statistics - so one can easily get the answer to questions like “How many sentences do I left un-recorded?”.

Again, thank you team!

PS: My analysis will take some time due to major changes in metadata.

bozden · March 21, 2024, 9:29am

@kathyreid, this might or might be… We only have validated percentage, the rest is a sum of invalidated and waiting. Please see this issue:

I don’t know that particular case, you might be right. But it can also be related to the following:

Some spammers record invalid sentences, or a few record many and their connection is bad cracks etc.
The text corpus might be wrong (mixed with other languages), which gets recorded. When recorded, there is no way to REPORT and INVALIDATE, so it is better to invalidate. See this…
Minority languages can be under pressure like emphasized here.

kathyreid · March 21, 2024, 9:35pm

@ftyers had a look at the Votic (vot) recordings - I understand this is a new language to Common Voice - and confirmed that the majority of invalidated clips were:

silence
slurred speech, possibly aphasia
and a crackly microphone

(huge thank you Fran for digging in to that).

You’re right, @bozden, my analysis doesn’t look at the domain of the sentence clips. I should fix that up.

bozden · March 21, 2024, 10:52pm

Great job to both of you, you figured it out. I also invalidated many clips lately from some teenagers, possibly having a party. Many new languages and/or users can have these, when the community is not properly directed / educated.

We should have some short videos explaining what is what, especially younger people do not read and prefer some YouTube videos. We created similar for our community, but they became outdated with the changes in the webapp. I need to renew them…

Maybe it is a good idea that the team produces some official ones, with the possibility of adding subtitles by communities for localization / maybe they can create their own localized ones looking at these.

bozden · March 22, 2024, 10:27pm

With the new version of the Metadata Viewer, I can see the following:

Although a pretty new feature, people started to use the domain information:

Some of the languages have very high value under unvalidated sentences. With small corpora, this can be expected, but for some like Arabic, Persian or Thai, the values are very high. Again, here, we cannot distinguish between invalidated and not yet reviewed thou.

On the global values, the validated Hours percentage keeps dropping… The recordings are there, but lead communities should validate them. I think we need a global event for recording & validation, as voiced in previous months.