Great points @kathyreid. Just to add some more on your valuable insights:
Validation keeps falling back globally as you see on the above graph. Some languages keep up, but some languages are left as it is. You can see the validation jumps/keep-up during v7.0 and v8.0, but afterward, it is linear, and the recording penetration is more steeper, so the difference sums up.
I see two reasons for that:
- Some communities are formed project-based and after they reach some point (e.g. 100h validated) they seem to leave the dataset/project. The Uzbek language seems to be one of them.
- At the end of 2021, towards v8.0 Common Voice had a global social media campaign, which resulted in a nice jump, also emphasizing the importance of the validation process. During that period the project team had stronger relations with language communities, helping them. This is unfortunately not the case after 2022.
I think a similar global campaign would provide a solution for the validation backlog and also for the whole project. Building more knowledgeable core groups that do the validation continuously will solve the problem.
Beware “Validated%” will never be 100% if there are invalidated recordings (which usually range between 2-10% depending on the language).
This is usual for new languages, but lack of diversity is definitely a problem with older ones with larger datasets.
Again, working closer with these communities and performing global events will help a lot.
Such analysis of the complete text-corpora became impossible after the sentences went into the database. Without a periodical export of these sentences, we cannot analyze the sentence diversity, vocabulary coverage, how much of the text-corpus has been covered, etc. See this issue.
I have them analyzed until March 2023, and also have simple different sentence counts of each split (up-to-date) in the Dataset Analyzer though - Text-Corpus and Sentences tabs respectively. Perhaps, at least temporarily, I can add a text-corpora analysis based on validated.tsv, which might at least show the current status on validated.
Any idea on this would be very valuable @kathyreid…