Speaker IDS for Speaker Recognition

Abhishek_Dandona · November 30, 2017, 9:55am

Hi Admin,

Thanks for releasing the dataset.
I notice that there are no speaker ids provided, but only age, gender and accent, therefore this dataset can’t be used for speaker recognition projects. If you could also provide speaker ids, which I believe can be gathered easily from username, then this will increase the usability of this dataset.

mhenretty · December 4, 2017, 4:04pm

Yup, we decided not to add speaker identifiers at this point (for privacy reasons). But, you can check out the Tatoeba dataset from our download page, which does group utterances by speaker.

http://voice.mozilla.org/data

Abhishek_Dandona · December 4, 2017, 4:14pm

Would it not be possible to anonymize the tags ? This way there won’t be any privacy issues.

mhenretty · December 4, 2017, 4:28pm

They would be anonymized, but there would still be privacy issues. It’s possible we may tackle this in the future, but right now we aren’t looking into that piece. Tatoeba is your best bet for now.

naktinis · December 5, 2017, 2:33pm

I also wanted to add that the lack of speaker identities also makes the dataset unsuitable for language identification models (that is, of course, when you add other languages). This is because you want to make sure that no speakers in your training set are in your test/validation sets (otherwise you will almost certainly overfit on speaker identities).

Maybe speakers could opt-in to have their samples marked with an anonymized id.

pwdonh · September 11, 2018, 6:43pm

I have a question related to the last post in this topic:

The corpus, as distributed, is split into train/dev/test sets. I understand you’re not sharing speaker IDs at this point. Are the splits done in such a way that a given speaker is not part of training and test set at the same time? This would be very useful information to interpret the generalization performance of an algorithm.

Is a given example sentence only spoken once by every speaker?

Thank you for your answers!

and.triantafy · October 17, 2018, 1:37pm

I came here looking for the answer to this question as well? Have you had any more progress with that?

Listening to some of the files I think I can recognize some speakers in both development and validation set (hint: group by age and look at age groups with few examples).

It would be great if someone from Mozilla could confirm that some (or all?) speakers (may?) appear in all sets.

stefan.falk · February 4, 2019, 12:27pm

Same here.

Not having the ability to group samples by their speakers is very unfortunate. Are there any updates on this topic? We are interested in knowing the distribution of contributions per speaker since it’s sometimes the case that very few speaker contribute a very substantial part of the spoken data.

Shouldn’t it be enough to throw in a random number (key), hash user-ids with that key and store those hashes as additional field and before throwing away the key?

gregor · February 4, 2019, 6:31pm

Thanks for the questions!
We’re including hashed client_ids with this release. And also we’re making sure that users are unique per bucket. You can find the code that does that here:

Codigo_Logo_Programacao_e_Inteligencia_Artificial · February 4, 2019, 7:42pm

This will be great for TTS research.

h_caulfield · October 15, 2020, 1:09am

Thanks for this confirmation.

On a related note, can you confirm if clientIDs are unique across all languages?

Pravin_Pandu · October 28, 2022, 2:05am

I just wanted to add that can you put a column stating the country of origin of the speaker. So that its helpful while analyzing as various countries have different pronounciation and its difficult to get a good model based on that. Thank you.

bozden · October 29, 2022, 8:04am

Welcome to Common Voice @Pravin_Pandu. Common Voice decided to use a user specified accent info for this purpose. You can read about it here:

On the other hand, being it a free field as it is, it does not provide reliable information.

Topic		Replies	Views
Speaker IDs for Speaker Identification Model Common Voice	2	1298	May 17, 2023
Can I download my voice data? Common Voice	11	2292	December 28, 2020
Mozilla Common Voice Speakers ID Common Voice	3	1298	November 10, 2020
Speaker ID split between train/test/dev Common Voice dataset	4	1000	February 15, 2019
Privacy concerns about dataset metadata Common Voice dataset	7	2759	May 16, 2019

Speaker IDS for Speaker Recognition

Related topics