Hello, I’m starting this thread to explain a problem/inconvenience and propose a possible solution.
I want to use CommonVoice dataset for a Text-to-Speech (TTS) project, and for that, I need the voices to be either male or female. CommonVoice does not provide this attribute for its recordings.
I was forced to rely on the ‘client_id’ attribute, which identifies each user account. I discovered that some people have their partners also record sentences from the same account. This means there are some samples that contaminate the dataset, resulting in both male and female voices being associated with the same user.
To address this, I developed a small (4.6 MB) CNN model that classifies audio files as male or female. It’s very reliable, accurate, and super lightweight. I’ve converted it to ONNX and deployed it live on HuggingFace so everyone can use it and see how well it works.
Here is the link: Cnn Voice Classifier - a Hugging Face Space by ctatu
Therefore, my proposal is to add this model to the web recording application so that it runs locally in each user’s browser and checks the voice’s gender. Naturally, the user will have the control to override the result if the model makes a mistake.
In this way, we could automatically add the ‘gender’ attribute to the recordings, and I believe it would add significantly more value than now, where there is no way to obtain this information.
I don’t mind implementing it myself. I can start an issue on the common-voice github repository.
Common voice allows the recorder to self-identify their gender, but I believe little people will change that opinion. However, I believe it’s not correct to label each voice’s gender and collect it into a database. It’s better to do it with your tool in the post-processing stage for each one’s requirement.
Hi there @cTatu - firstly thanks for your interest in Common Voice, and for your feedback. The Common Voice speech data is primarily intended for speech recognition and has not been designed with the different requirements of speech synthesis in mind.
There are multiple challenges to using Common Voice for TTS. As you point out, the client_id can sometimes contain multiple speakers, likely from recording sessions where one machine is used to record many speakers. We also have the opposite challenge, where one speaker records under multiple client_ids - because they’re not logged in or don’t create and account, and are assigned a different client_id for each session.
You may be more comfortable using an alternative dataset for TTS purposes, such as those created specifically for TTS - such as LibriTTS or LJSpeech - and I would be happy to provide those details.
Moreover, none of the speakers in Common Voice has provided consent for their voices to be cloned via TTS, and you may wish to consider the ethical implications of your work.
To the topic of classifying gender: we provide a range of options for speech data contributors to disclose the demographic information they’re comfortable disclosing. This ranges from none - for example if they do not create a Profile and submit speech data while not logged in - to specifying their age range and gender expression via their Profile. Some people choose to explicitly state “prefer not to say” with their gender.
If speech data contributors have chosen not to make a Profile and supply demographic data, then we need to carefully consider whether it is ethical to apply a gender classifier after the fact - because we do not have consent from the data contributor to apply a retrospective classification to their voice.
We will not be applying a gender classifier to the data currently in Common Voice.
If you wish to perform this as a form of data pre-processing for your purposes, there is nothing license-wise that prohibits this processing, however I strongly urge you to consider the ethical aspect of retrospectively classifying demographic aspects such as gender without the consent of data contributors.