Hello all! I wanted to drop in to chat to you about some upcoming changes about how we allow voice contributors to optionally describe their sex or gender. This will change the metadata associated with the dataset.
To summarize the changes, we’ll be renaming this optional field to “Sex or Gender” to more accurately represent the expanded range of options that contributors may choose from.
Our original options of Female, Other, Male have been expanded to include the options below:
Female/Feminine
Male/Masculine
Intersex
Transgender
Non-binary
Don’t wish to say
There’s a blog post with more detail available here, but I’m also so happy to answer any questions anyone may have about the upcoming changes!
I’m trilled to see such changes, I’m all in for non-biological gender concept
I don’t know much on voice-AI vs non-binary gender research
I’m not trying to second-guess project/team decisions
So I have to ask some questions for clarification:
In v16.1, there are 26 languages (out of 120) where “other” percentage among recordings are >0.5%. These values are generally around 1, some low resourced ones are 12-13% (dominated by few voices). I did not calculate -yet- the total number of people/recordings (which are different values) who identified themselves as “other”. We also “know” that non-biological genders are 1-3% of the population globally. So:
I think, as total % is generally low, further dividing the bucket into smaller ones might not result in statistically relevant data. How can such non-relevant data be used in research?
You are resetting the current “other” to “no-info”, so we are losing information. In the long run, this can be remedied, but it will take too much to have some relevant data (see Q1).
a) It is also a known fact that people come, record for some time and go, the user retention in CV is very low. So, the data you delete will not come back (in more detail) because people who recorded a couple of years ago most probably will not come back. Will you be calling back all previous volunteers (e.g. by mass mail)?
b) The gender data is inserted into the database when a recording is done and reflects the current settings in the profile. So, even if a person comes back, logins and sets a new data into the “gender” field, the old recordings from the same person will not reflect that change. Do you have any plan on how you can remedy this? E.g. automatically marking the past records can also be problematic as “gender” identity can change in a persons life…
c) The new options do not directly represent options in the LGBTQIA+ communities (each year a new letter is added), you somehow re-categorized them, so there again will not reflect all - not all will be satisfied with these. IMO, where people do not like to give too detailed information because of privacy reasons (see Q3), “other” was an alternative (for example intersex is also biological). Did you think of keeping the “other” info and asking people saying “if you want more detailed representation we have new options”?
I find non-biological gender as a more privileged/private information. Although it is optional, when given it might raise privacy problems. From STT perspective, I’m not sure how this can be useful, as we try to generalize everyone in a model. But it is different for TTS, somebody can pick recordings of such a person and generate “gay-sounding” animated characters (you already know my concerns on TTS usage for voice duplication). What do you think?
Hey, Giuseppe here, a researcher on fair speech and language technologies. Earlier in 2024, we published a study to measure the difference in the speech recognition performance of modern AI tools for many languages. The bulk of the study was on Common Voice 16.1 data.
Some of our experiments focused on comparing Masculine data vs “Other”, which was available back then.
However, I’m noticing that in new versions of the dataset, for most of the languages we tested there (including high-resource ones), genders other than “M/F” are not represented anymore (e.g., see the [gender distribution] in CV20).
I know that training/validation/test splits change from version to version. However, the statistics on GitHub should refer to the whole dataset.
In this sense, have all the recordings from “Other” speakers in versions <=16.1 been deleted?
P.S. In the paper’s final part, we discuss how to build gender-aware splits (section “Improve sociodemographic representation.”). I would happily chat about it for the upcoming releases and help where needed!
Hey @g8a9, any data referencing “other” has been deleted, others (male/female) have been renamed. People from “other” are expected to give their gender information once again, according to new options. That data is lost.
For your research, there is the possibility to recover the data by matching client_id demographics thou.
build gender-aware splits
I’ve been working on splitting algorithms for a while now (also criticizing the existing CorporaCreator algorithm in an internal “paper”), for all CV languages, and proposed new ones (open source). The problem here is: The datasets are much diverse. There are some with few validated recordings where you even cannot create train/dev/test splits. Thus that approach cannot be scaled to all languages, but maybe only for large datasets. Even some large ones have too few female voices.
It also depends what you are trying to accomplish. The very conservative default splits are good as a comparison throughout timeline/versions thou.