200 Ways to hear 'Him' and only 10 ways to hear 'Her'?

A_Wyman · July 23, 2017, 5:25am

The text to be read and listened to on the site falls short of providing an equal representation of female pronouns. It doesn’t make sense to teach the AI two hundred ways to hear “him” and only 10 ways to hear ‘her’. A small sample I counted about 20 him/his/he to 9 ‘its’ to 1 ‘she’!! The texts supplied: War of the Worlds largely favors ‘it’ and ‘him’, the Alchemist–‘he’, ‘him’, ‘his’, ‘men’ etc. The ‘funny’ text also is problematic and marginally offensive, dated and insensitive at times. Ideally, using more inclusive and diverse authors and text should solve this problem.

mhenretty · July 23, 2017, 12:15pm

Thank you for raising this important topic, and we appreciate your insight.

Totally agreed the text needs more diversity and representation. It has actually been quite hard for us to get good sentences because we want to release the voice data into the public domain, and so we need public domain source sentences. Since a lot of conversational text that is in the public domain is old movies and books, the language can be quite antiquated and biased. We are currently investigating using The Stanford Natural Language Inference (SNLI) Corpus, but there are some licensing issues we need to work through. See this github thread for more details.

In the meantime, we would love help expanding our sentence collection! If anyone public domain source suggestions, or if you have text you personally wrote and want to put in the public domain (as part of Common Voice), this would be very helpful to the project.

A_Wyman · July 24, 2017, 2:30am

Great. Cool to read about the SNLI. I will reach out to my students to see what they might suggest as we all have our own cultural blindspots. I will also take a look through public domain books and see what might be suitable or counter balance what you all have so far. Are you looking for conversational text specifically or is short descriptive text okay too? Might be nice to run it through some auto linguist analysis for bias if there is such a thing.

mhenretty · July 24, 2017, 11:59am

Mostly we are looking for conversational, but short descriptive text would work too. At this point we are open to exploring any source

Also, I totally agree about using linguist tools.

Pacman · August 2, 2017, 9:55pm

Existing text could be changed by substituting the opposite pronouns to even out the representation. That solves the problem of needing to use public domain sources. A simple regular expression search and replace should do the trick.

Topic		Replies	Views
How will the lack of female voices be handled? Common Voice feedback	9	1708	May 30, 2019
Prompt design Common Voice	11	2359	January 15, 2018
How to "un-bias" a language? Common Voice	11	914	March 7, 2021
I think its time to talk about AI generated sentences again Common Voice	11	1365	March 30, 2023
Common voice sentences are the opposite of "common" Common Voice participation , sentence-collection , feedback , issue	27	3867	September 7, 2024

200 Ways to hear 'Him' and only 10 ways to hear 'Her'?

Related topics