Major "loss" in demographic data

bozden · January 28, 2022, 4:44am

Now I’ve got your attention, the situation can actually be saved…

Our main goal for v8 (in Turkish) was to balance the dataset - while increasing text & voice corpora. Therefore, in our campaign, we emphasized the importance of demographic data.

This is from v7.0 dataset:

And this is from the new v8.0:

These are from validated.tsv record distributions. As you can see records without demographic data increased, despite our efforts (see blank rows/columns).

This is already an issue in CorporaCreator, see:

github.com/common-voice/CorporaCreator

Demographic data is sometimes doubled per client ID

opened 02:36PM - 14 Oct 21 UTC

ftyers

In some cases a given `client_id` might have more than one demographic datapoint… (e.g. gender or age) linked to it. Often this is `blank` vs. `male`/`female` or `blank` vs. some age. This is probably because people recorded some clips then made a profile, or because they became logged out. In any case it would be good (and probably safe) to replace `blank` in the field with the more specific datapoint if and only if there are no other datapoints associated with the `client_id`. Some examples from Turkish, with thanks to @harikalarkutusu! ![image](https://user-images.githubusercontent.com/449545/137339341-a30595db-be51-45f1-b467-cc04245cfa0e.png)

In v7.0 Turkish dataset we had 26 such persons affecting 1,327 recordings, in v8.0 this increased to 79 persons and 10,577 recordings !!!

Two possible causes:

The following bug caused deletion of demographic data, I was alert on that and it happened:
[Bug] Changing leaderboard visiblity result in removed age and gender in profile settings (smartphone) · Issue #3353 · common-voice/common-voice · GitHub
CV kicked people out many times a day after the server upgrades in late October. People might be tired of logging in and kept recording without logging.

I calculated 16.73% of the above 32.36% in the v8.0 dataset can be corrected, leaving only 15.63% without actual demographics (these numbers are for gender, but should be similar for age).

Please be aware of this if you are using demographic info in your own splits/trainings.

To remedy:

Issue in CorporaCreator should be resolved (it’s effect is cumulative).
Less automatic logouts
Easier login system (see: [req] Passwordless Login And Two-Factor Authentication for CV website · Issue #3393 · common-voice/common-voice · GitHub)
Warn your community to re-check/re-fill their demographics info.

daniel.abzakh · January 30, 2022, 9:15pm

Can you elaborate on this? How did you correct it?

Regarding the remedies, our team is having similar problems! I did relate the message to Hillary @heyhillary , I have been told that around September or earlier, a person can fill up the demographic data and contribute without having to login (as a guest).

bozden · January 30, 2022, 9:26pm

I did not correct anything, I calculated how many can be corrected by analyzing the data.

It should be corrected globally in datasets as indicated here:

github.com/common-voice/CorporaCreator

Demographic data is sometimes doubled per client ID

opened 02:36PM - 14 Oct 21 UTC

ftyers

In some cases a given `client_id` might have more than one demographic datapoint… (e.g. gender or age) linked to it. Often this is `blank` vs. `male`/`female` or `blank` vs. some age. This is probably because people recorded some clips then made a profile, or because they became logged out. In any case it would be good (and probably safe) to replace `blank` in the field with the more specific datapoint if and only if there are no other datapoints associated with the `client_id`. Some examples from Turkish, with thanks to @harikalarkutusu! ![image](https://user-images.githubusercontent.com/449545/137339341-a30595db-be51-45f1-b467-cc04245cfa0e.png)

Topic		Replies	Views
Changes to optional Sex or Gender fields for voice contributors Common Voice	4	767	March 6, 2025
Inadequate Documentation Common Voice documentation	9	1633	September 23, 2022
2020 End-of-Year Common Voice Dataset Release Common Voice announcements	4	3306	December 22, 2020
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	329	February 28, 2025
Am 6. Juli kommt der neue Datensatz - das könnt Ihr jetzt tun um ihn noch besser zu machen Deutsch (de)	0	863	June 14, 2022

Major "loss" in demographic data

Related topics