"Offensive" language


In this thread I want to discuss a bit the meaning, the pros and cons of the following sentence contribution policy:
“be nice, don’t use offensive language. we aren’t collecting that kinda material.”

Is it really that you don’t want to collect this kind of material or is it rather that you don’t want to present offensive sentences to our readers and verifiers? I can perfectly understand the latter and I support this argument, especially in regard to the variety of cultural backgrounds and ages.

However, in reality people will want to use their speech recognition software in any way they want. Especially if they run the software locally and trust it that their wording will not be saved in the cloud. This includes offensive and also pornographic language. In case you need an eye-opener: The little German search engine DeuSu (which advertises privacy) publishes an uncensored list of the 100 mostly used search keywords every year: https://deusu.de/blog/2016-11-22-wonach_deutschland_in_2016_wirklich_gesucht_hat.html
I was surprised by myself. So people will feel censored if the software works well, but always fails identifying words of a certain category.

This is why I have tried to include some of such words into my contributions (as you probably have noticed), but put them in an innocent context. For example, although often used in a different context, the actual meaning of “cock” is “rooster”. And indeed, this already has triggered some discussion:

I really like that this happens publicly, by the way. That’s less black-box testing for me.

I’d like to read some opinions on this topic, from the Mozilla guys, as well as from other contributors.

[Help Wanted] Write some nice, short sentences for people to read
(Michael Henretty) #2

Thank you for raising this topic @jf99.

I totally agree with you that we do want “offensive” words in our data set. But as you guessed, for now we are trying to make the user experience nice for the Common Voice visitors, which includes children. But this is indeed a problem we will need to tackle sooner rather than later.

One idea we have been considering is having an R-Rated section, where people can go to yell profanities into their microphone. What ever way we decide approach this, I think it will need to be an “opt-in” rather than default.

I’m interested to see if others have thoughts on what we could do to collect some not-so-nice word.


on a related note, recently i was listening to a recording that said “you are going to get fucked” this was not the sentence that should have been spoken. I am not really sure if when i press “nope”, more people have to be subjected to that verbal abuse (I was not in the least bit offended, just thought it a little strange) but if they would be i think it would make sense to be able to report it somehow.

And yes i do agree it would make sense to have an R-rated section so that “Offensive” language can be recognized.
One could even start with the “abusive” reports and just add the text retroactively (assuming there are not a flood of trolls abusing the system)

(Michael Henretty) #4

Note, we also have a bug to allow people to flag potentially offensive content: