Bad words words list for your languages

Right now the https://github.com/Common-Voice/common-voice-wiki-scraper include a blacklist word generator but can be not perfect or not detected badwords or other kind of stuff. Wikipedia doesn’t include them but can happen so how we can avoid it?

There are different project on github but I found only this one that support other languages than english https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/

I am contributing now to improve the italian on, and maybe we can improve the tool and aggregate this words to the blacklist.

I have checked out that github project and spanish list does not seem appropiate to me.

It includes some gross language (as expected) but also some common words like ‘asesinato’ (murder).

An overview of another lists (‘pt’ for example) shows similar trends. English list is well compilated however.

Perhaps we could collect some samples from Wordpress plugins or profanity prevention editors. Heuristic classificators for web filtering could also have an useful list. The old DansGuardian has got this pretty complete badwords list: http://contentfilter.futuragts.com/phraselists/

1 Like

I saw that the Italian list is not so complete but I think that is something to start or maybe to contribute too.
The point is that sometimes badwords are verb or other words that based of the context they are dirty words so this can create issues for the purpose of this project.
I think that is enough to do a list of badwords and other dirty words that we want to be sure are not included in the sentence to read in common voice.