Now I’ve got your attention, the situation can actually be saved…
Our main goal for v8 (in Turkish) was to balance the dataset - while increasing text & voice corpora. Therefore, in our campaign, we emphasized the importance of demographic data.
This is from v7.0 dataset:
And this is from the new v8.0:
These are from validated.tsv record distributions. As you can see records without demographic data increased, despite our efforts (see blank rows/columns).
This is already an issue in CorporaCreator, see:
In v7.0 Turkish dataset we had 26 such persons affecting 1,327 recordings, in v8.0 this increased to 79 persons and 10,577 recordings !!!
Two possible causes:
-
The following bug caused deletion of demographic data, I was alert on that and it happened:
https://github.com/common-voice/common-voice/issues/3353 -
CV kicked people out many times a day after the server upgrades in late October. People might be tired of logging in and kept recording without logging.
I calculated 16.73% of the above 32.36% in the v8.0 dataset can be corrected, leaving only 15.63% without actual demographics (these numbers are for gender, but should be similar for age).
Please be aware of this if you are using demographic info in your own splits/trainings.
To remedy:
- Issue in CorporaCreator should be resolved (it’s effect is cumulative).
- Less automatic logouts
- Easier login system (see: https://github.com/common-voice/common-voice/issues/3393)
- Warn your community to re-check/re-fill their demographics info.