Speaker IDS for Speaker Recognition


(Abhishek Dandona) #1

Hi Admin,

Thanks for releasing the dataset.
I notice that there are no speaker ids provided, but only age, gender and accent, therefore this dataset can’t be used for speaker recognition projects. If you could also provide speaker ids, which I believe can be gathered easily from username, then this will increase the usability of this dataset.


(Michael Henretty) #2

Yup, we decided not to add speaker identifiers at this point (for privacy reasons). But, you can check out the Tatoeba dataset from our download page, which does group utterances by speaker.


(Abhishek Dandona) #3

Would it not be possible to anonymize the tags ? This way there won’t be any privacy issues.


(Michael Henretty) #4

They would be anonymized, but there would still be privacy issues. It’s possible we may tackle this in the future, but right now we aren’t looking into that piece. Tatoeba is your best bet for now.


(Rimvydas Naktinis) #5

I also wanted to add that the lack of speaker identities also makes the dataset unsuitable for language identification models (that is, of course, when you add other languages). This is because you want to make sure that no speakers in your training set are in your test/validation sets (otherwise you will almost certainly overfit on speaker identities).

Maybe speakers could opt-in to have their samples marked with an anonymized id.


(Peter Donhauser) #6

I have a question related to the last post in this topic:

The corpus, as distributed, is split into train/dev/test sets. I understand you’re not sharing speaker IDs at this point. Are the splits done in such a way that a given speaker is not part of training and test set at the same time? This would be very useful information to interpret the generalization performance of an algorithm.

Is a given example sentence only spoken once by every speaker?

Thank you for your answers!