Major "loss" in demographic data

Now I’ve got your attention, the situation can actually be saved…

Our main goal for v8 (in Turkish) was to balance the dataset - while increasing text & voice corpora. Therefore, in our campaign, we emphasized the importance of demographic data.

This is from v7.0 dataset:

And this is from the new v8.0:

These are from validated.tsv record distributions. As you can see records without demographic data increased, despite our efforts (see blank rows/columns).

This is already an issue in CorporaCreator, see:

In v7.0 Turkish dataset we had 26 such persons affecting 1,327 recordings, in v8.0 this increased to 79 persons and 10,577 recordings !!!

Two possible causes:

  1. The following bug caused deletion of demographic data, I was alert on that and it happened:

  2. CV kicked people out many times a day after the server upgrades in late October. People might be tired of logging in and kept recording without logging.

I calculated 16.73% of the above 32.36% in the v8.0 dataset can be corrected, leaving only 15.63% without actual demographics (these numbers are for gender, but should be similar for age).

Please be aware of this if you are using demographic info in your own splits/trainings.

To remedy:

  1. Issue in CorporaCreator should be resolved (it’s effect is cumulative).
  2. Less automatic logouts
  3. Easier login system (see:
  4. Warn your community to re-check/re-fill their demographics info.

Can you elaborate on this? How did you correct it?

Regarding the remedies, our team is having similar problems! I did relate the message to Hillary @heyhillary , I have been told that around September or earlier, a person can fill up the demographic data and contribute without having to login (as a guest).

I did not correct anything, I calculated how many can be corrected by analyzing the data.

It should be corrected globally in datasets as indicated here:

1 Like