The text to be read and listened to on the site falls short of providing an equal representation of female pronouns. It doesn’t make sense to teach the AI two hundred ways to hear “him” and only 10 ways to hear ‘her’. A small sample I counted about 20 him/his/he to 9 ‘its’ to 1 ‘she’!! The texts supplied: War of the Worlds largely favors ‘it’ and ‘him’, the Alchemist–‘he’, ‘him’, ‘his’, ‘men’ etc. The ‘funny’ text also is problematic and marginally offensive, dated and insensitive at times. Ideally, using more inclusive and diverse authors and text should solve this problem.
Thank you for raising this important topic, and we appreciate your insight.
Totally agreed the text needs more diversity and representation. It has actually been quite hard for us to get good sentences because we want to release the voice data into the public domain, and so we need public domain source sentences. Since a lot of conversational text that is in the public domain is old movies and books, the language can be quite antiquated and biased. We are currently investigating using The Stanford Natural Language Inference (SNLI) Corpus, but there are some licensing issues we need to work through. See this github thread for more details.
In the meantime, we would love help expanding our sentence collection! If anyone public domain source suggestions, or if you have text you personally wrote and want to put in the public domain (as part of Common Voice), this would be very helpful to the project.
Great. Cool to read about the SNLI. I will reach out to my students to see what they might suggest as we all have our own cultural blindspots. I will also take a look through public domain books and see what might be suitable or counter balance what you all have so far. Are you looking for conversational text specifically or is short descriptive text okay too? Might be nice to run it through some auto linguist analysis for bias if there is such a thing.
Mostly we are looking for conversational, but short descriptive text would work too. At this point we are open to exploring any source
Also, I totally agree about using linguist tools.
Existing text could be changed by substituting the opposite pronouns to even out the representation. That solves the problem of needing to use public domain sources. A simple regular expression search and replace should do the trick.