Hi,
we want to train a speaker recognition / speaker identification model using a metric learning approach, to be able to identify speakers in a large dataset. The commonvoice datasets provides speaker information with it’s “client_id” meta information. There have already been 2 diskussions on this board concerning speaker recognition and client ids:
From these discussions we know that some speakers might appear with multiple different client_ids, especially when they are not logged in but we are willing to take the risk and see how far we can get with this limitation. Now to the actual question:
When downloading the dataset we all have to agree “to not attempt to determine the identity of speakers in the Common Voice dataset”. Does this clause prohibit training speaker identification algorithms with mozilla common voice in general? What are the limitations that this clause imposes to projects like this?
To prevent confusion about the goals of our project, here is what we want to do:
- Train similaritiy measures of speaker embeddings using triplet networks, LDA or similar techniques .
- Determine the quality of the algorithm, by doing inference on the test set.
- In production use: use the trained model to extract embeddings of new and unrelated audio files to create an inference database and match other files against this database.
what we explicitly don’t want to do:
- use commonvoice data or derived embeddings in production use, except the model weights of course.
- match common voice embeddings with real world audio files, to identify the speakers of commonvoice
You might be interested that there already are publications doing this exact thing with mozilla commonvoice. Like in this publication from the university of Lille
https://hal.inria.fr/tel-03539738/document