Speaker IDs for Speaker Identification Model

Hi,

we want to train a speaker recognition / speaker identification model using a metric learning approach, to be able to identify speakers in a large dataset. The commonvoice datasets provides speaker information with it’s “client_id” meta information. There have already been 2 diskussions on this board concerning speaker recognition and client ids:


From these discussions we know that some speakers might appear with multiple different client_ids, especially when they are not logged in but we are willing to take the risk and see how far we can get with this limitation. Now to the actual question:

When downloading the dataset we all have to agree “to not attempt to determine the identity of speakers in the Common Voice dataset”. Does this clause prohibit training speaker identification algorithms with mozilla common voice in general? What are the limitations that this clause imposes to projects like this?

To prevent confusion about the goals of our project, here is what we want to do:

  • Train similaritiy measures of speaker embeddings using triplet networks, LDA or similar techniques .
  • Determine the quality of the algorithm, by doing inference on the test set.
  • In production use: use the trained model to extract embeddings of new and unrelated audio files to create an inference database and match other files against this database.

what we explicitly don’t want to do:

  • use commonvoice data or derived embeddings in production use, except the model weights of course.
  • match common voice embeddings with real world audio files, to identify the speakers of commonvoice

You might be interested that there already are publications doing this exact thing with mozilla commonvoice. Like in this publication from the university of Lille
https://hal.inria.fr/tel-03539738/document

Do you have a human ethics or institutional review board approval for this project? If so, can you share it?

2 Likes

Thanks so much for getting in touch! As you have noted, the terms of use in the Mozilla Common Voice datasets include the following “You agree to not attempt to determine the identity of speakers in the Common Voice dataset”

While I’ll need have to talk to the legal team for comprehensive information on the details of these limits, we strongly encourage dataset consumers and community members to join us in taking as broad and robust an approach to contributor privacy as possible, which would bar the use of this data for use in processes that require identification of individual contributors through matching and grouping their separate contributed clips.

There may be other, non-Mozilla voice datasets with different terms of use that might be a better fit for your project.

3 Likes