Age meta data

Akeith · September 17, 2022, 5:07pm

Hi

We’re interested in using the CV dataset to research the feasibility of building a solution that can estimate someone’s age using their voice.

We downloaded a small sample of the CV dataset a few months ago and the exact age of each speaker was included in the meta data, however it seems that now, the meta data only provides an age range (<19, 19-29, 30-39 etc).

How can I get access to the full 73GB CV dataset that includes exact ages for each speaker? Having the exact age will significantly aid our research into the feasibility of this solution.

bozden · September 17, 2022, 7:50pm

Hi @Akeith, welcome…

AFAIK, that information does not/cannot exist.

Due to privacy related laws/rules and related ethics, providing that information has been voluntary in Common Voice. Only a portion of users do register to volunteer (you can record without subscription), and provide demographics information. And even they do, the age data is also voluntary (i.e. can be left blank) and it is based on age ranges.

But I’m very intrigued by your project, I’m working on a human moderation tool and left the age out of moderation.

Even humans cannot tell an exact age, we can only distinguish between child/young/elderly… Even gender can be difficult sometimes.

kathyreid · September 19, 2022, 11:50pm

Could you please link to where this sample was downloaded from?

Common Voice has never collected the exact ages of data contributors as far as I’m aware, and I’d be concerned if this was the case for privacy reasons, because it could be used to re-identify people in the dataset.

Akeith · September 26, 2022, 5:27pm

Thanks both for your replies and sorry for my slow reply. I’ve been chasing our ML team and have just received the following update:

“…one of our Data Scientists has insisted he has seen it but is unable to point us to the source and we are unable to retrieve backups of it.”

As they’re “unable to find the source or retrieve backups” I suspect that there wasn’t in fact a link to a sample data set containing voice samples with the associated ages of the speaker, so apologies for causing concern!

That said, and re the privacy concerns you both mentioned, age is not linkable PII. It isn’t something that can be used (in isolation) to identify someone, so there shouldn’t be any issue exposing this data where available.

With the purpose of Common Voice being to make these large datasets available to all to help encourage innovation, it would be great to understand how we can get access to the dataset with the age unmasked (where age has been provided by the user).

Feel free to reach out privately by email if preferable.

bozden · September 26, 2022, 6:38pm

Thank you, good to know

I think it is doable in a controlled environment. A fork can be used to modify the SW for a set of people who give consent. A more larger limitation comes from the fact that you need to be at least 20 or more (or have consent & constant supervision) from a legal guardian. So, there are no voices from children here (only a few), but one can devise similar setup for a project about children’s speech related problems, language learning etc.

There was a related discussion lately, also pointing to security implications: Tags for voice (accent)

Akeith · September 26, 2022, 6:56pm

Re the related discussion, I don’t agree that there are any security risks associated with a person giving their age at the time of recording their voice. Age is not date of birth. Equally, a user could give their year of birth instead of their age. Again, this is not date of birth and therefore wouldn’t equate to PII. Additionally, banks typically use voice recognition software either at the beginning of calls when you’re speaking to an operator, or before the call starts where you may be asked to say a specific phrase, so you couldn’t just playback a recording of a random voice clip that you’ve obtained from the CV dataset!

Re accessing the data with the age exposed in a controlled environment, how should we do this?

Re voices of children, understood. That is going to be difficult, but at this stage we are simply trying to confirm the feasibility of the project, which we can do without necessarily having children’s voices.

Akeith · October 3, 2022, 9:33am

@bozden @kathyreid any ideas on how we can move the above forward, i.e. accessing the data in a controlled environment?

bozden · October 3, 2022, 10:15am

@Akeith, I think there is a misunderstanding. There is no such data to access.

What I was saying was, the SW is open-source, you can modify it respecting the Mozilla license, and collect your own data for your experiment, in a controlled environment.

Btw, we are volunteers here, trying to helping out. We have no other rights or access to the data.

Akeith · October 3, 2022, 11:32am

Ahh OK, yes misunderstanding.

I’ll speak to our ML team and discuss next steps with them.

Appreciate you taking the time to respond

kathyreid · October 4, 2022, 1:41am

@Akeith if what you’re looking for is more granular data from particular ages, then you may be better off with a paid data provider such as Appen, Data Ocean or Telus. These providers have “off the shelf” offerings of voice data, or, depending on your budget, you can have custom datasets collected.