Are there any stats available on the gender-breakdown of the dataset? From the small subset I was able to analyse, it appears to be over 9:1 male:female voices in the English language dataset.

If this is representative of the data as a whole, we stand a pretty serious risk of accidentally biasing the algorithms trained on the dataset.

Is there anything we can do to substantially improve non male representation within the data? Can we partner with female-centric organisations to recruit more non-male volunteers?

I also think there is a danger of under performance in the trained model for female, children and old people voices as they are underrepresented.

The best we can do is to spread the existence of Common Voice hoping more women are willing to contribute.

I guess we could bet on the feminist networks as they are very active on social medias and could take that danger seriously. I mean, there is no doubt feminists would feel concerned if you tell them we are going to live in a voice controlled world where AIs can only understand male voices.

I think there might be a bias over the population that hear about Common Voice, mostly because it is spread though IT places like MyCroft or inside Mozilla.

The bias in contributors might come from those biased places (which are mostly represented by males aged between 20 and 40 years old).

I would add that there is more globally an issue on spreading the word about Common Voice.
For exemple, we are 67 millions people in France and only 1300 contributors for CV with maybe a maximum of 200 regular contributors.

I think that bias (males 20-40) could be removed if a lot more people hear about CV.

My two cents on how we could get more people to know Common Voice :

  • Share it on social medias
  • Talking about it with people we know
  • Do events
  • Warn feminists hoping they spread the issue

I am sorry if that’s obvious propositions but I can’t see any more yet.

PS: I would like Mozilla to know that if you are looking for people to promote Common Voice in events in France, I would be happy to help there.


Targeted web-based advertising would seem like a good way to attack this problem.

Currently, Mozilla has put no marketing effort behind Common Voice, all our traffic comes organically through media articles and blog posts. If we did put some money into driving traffic to the site, I agree we should focus on a female audience.

With this being the case, it is very hard for us to control the demographics of people who donate their voice, and indeed we don’t want to prevent anyone from donating. The way we are currently tackling this problem is to incentivize more people to report their sex (through our upcoming profile system). This way, at the very least data consumers can segment the data appropriately for their use case.

Michael can you link me to some info on the upcoming profile system?

I’ve been wondering if it might be possible to create an in-person street/stall experience that mozillians could use to recruit more volunteers. Something like a multiple microphone input setup to simultaneously record multiple samples with different types of equipment in a purposely challenging environment (e.g. outside with wind and traffic nosies). If people could read each sentence, but have a variety of smartphones, a headsets etc. record them all at the same time, you could increase the rate of sample collection and substantially increase the diversity of devices represented for each sample. You could link the audio samples together for validation, since the actual words spoken would be identical.