Now I’ve got your attention, the situation can actually be saved…
Our main goal for v8 (in Turkish) was to balance the dataset - while increasing text & voice corpora. Therefore, in our campaign, we emphasized the importance of demographic data.
These are from validated.tsv record distributions. As you can see records without demographic data increased, despite our efforts (see blank rows/columns).
This is already an issue in CorporaCreator, see:
In v7.0 Turkish dataset we had 26 such persons affecting 1,327 recordings, in v8.0 this increased to 79 persons and 10,577 recordings !!!
CV kicked people out many times a day after the server upgrades in late October. People might be tired of logging in and kept recording without logging.
I calculated 16.73% of the above 32.36% in the v8.0 dataset can be corrected, leaving only 15.63% without actual demographics (these numbers are for gender, but should be similar for age).
Please be aware of this if you are using demographic info in your own splits/trainings.
To remedy:
Issue in CorporaCreator should be resolved (it’s effect is cumulative).
Can you elaborate on this? How did you correct it?
Regarding the remedies, our team is having similar problems! I did relate the message to Hillary @heyhillary , I have been told that around September or earlier, a person can fill up the demographic data and contribute without having to login (as a guest).