Mozilla Common Voice Speakers ID

Eran_Gilead · September 3, 2020, 5:34am

You probably already had this question before but it’s hard for me to find a straight answer so maybe you could help.
I’m looking to train a speaker recognition model and from reviewing the English and French I found that some of the different client_id sounds the same to a point I think they are the same.
Can I trust the different ID or is there a chance to have several ID’s for one speaker? Do you have any suggestion on how I can get the best speaker separation on my data before I start training?

Adrijaned · September 3, 2020, 8:25pm

Unless the speaker was logged in at the time of the recording, he may easily have a different id for each recording. As for determining that and relinking the data - you can’t. Unless you do that in some way from the sound itself, there is no data anywhere that would help you with that. I would also at least personally consider such effort in a gray area legal-wise, as you agreed to not try to discern the identities of the individual speakers when you downloaded the dataset.

cjbaker · September 3, 2020, 10:43pm

Is there a way to figure out which utterances were made by logged-in users, rather than by guests? Inconsistent speaker IDs are very problematic for training and validating ASR models. This problem would also make the dataset useless for training models for speaker identification, diarization, etc. If I could figure out which data was recorded by guests, I would just discard it before training my model to avoid cross-contaminating the training and validation sets.

I had previously assumed that the train/valid splits were usable, but now I see that after logging out of mozilla.org I can still make recordings. I think that only logged-in users should be able to record in order to minimize this problem.

timopheym · November 10, 2020, 1:02am

I agree, it makes sense at least to add a boolean field is_ananymous.