Bad words words list for your languages

Mte90 · September 2, 2019, 11:15am

Right now the https://github.com/Common-Voice/common-voice-wiki-scraper include a blacklist word generator but can be not perfect or not detected badwords or other kind of stuff. Wikipedia doesn’t include them but can happen so how we can avoid it?

There are different project on github but I found only this one that support other languages than english https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/

I am contributing now to improve the italian on, and maybe we can improve the tool and aggregate this words to the blacklist.

alfem · September 3, 2019, 6:07am

I have checked out that github project and spanish list does not seem appropiate to me.

It includes some gross language (as expected) but also some common words like ‘asesinato’ (murder).

An overview of another lists (‘pt’ for example) shows similar trends. English list is well compilated however.

Perhaps we could collect some samples from Wordpress plugins or profanity prevention editors. Heuristic classificators for web filtering could also have an useful list. The old DansGuardian has got this pretty complete badwords list: http://contentfilter.futuragts.com/phraselists/

Mte90 · September 3, 2019, 1:13pm

I saw that the Italian list is not so complete but I think that is something to start or maybe to contribute too.
The point is that sometimes badwords are verb or other words that based of the context they are dirty words so this can create issues for the purpose of this project.
I think that is enough to do a list of badwords and other dirty words that we want to be sure are not included in the sentence to read in common voice.

solyarisoftware · October 21, 2019, 3:32pm

Hi Daniele,

What’s the reason to blacklist badwords?

If common voice goal is to create a sentences dataset of real spoken language, so badwords are (unfortunately) big part of real common spoken languages and they have to be part of the dataset. Otherwise we are “biasing” the dataset and that’s usually considered an huge defect of any ML-platform training data set (e.g. DeepSpeech in our case).

BTW, there is any Common Voice documentation I can read, about words/topics blacklist filtering decisions/regulations? Thanks.

Conversely, I understand the need to do not propose offensive sentence to anyone that contribute via common voice website. Maybe users would be option-in.

BTW, for Italian language, I used in past this badword list https://github.com/napolux/paroleitaliane/blob/master/paroleitaliane/lista_badwords.txt

giorgio

Mte90 · October 21, 2019, 3:46pm

The point is that Common Voice need it because doesn’t have any parental advisory.
So the project is suitable by everyone and for D&I we need to be sure about the content. Take as example if we do an event about CV in a middle school with kids?

solyarisoftware · October 21, 2019, 4:36pm

I see, but if you cut here and there, for any ethical apparently good reason, the language model will be biased. That’s bad.

Maybe the parental control issue could be mitigated by an opt-in flag in the common voice website UX. By example any adult contents could be disabled by default, but website users must be allowed (details to be defined) to manage any kind of language content.

I renew my ask: there is any Common Voice regulation/decision document on content filtering?

Thanks

Mte90 · October 21, 2019, 4:39pm

Right now in CV there is anything about it so the discussion was opened to define this point, in the past this problem was raised few times with no official statement about it.

So until CV doesn’t do anything we need to adapt and find a way

nukeador · October 21, 2019, 4:46pm

Hi,

So, yes we don’t want offensive language to be part of the Common Voice site experience.

At the same time, @Mte90 CV site doesn’t allow people under 19 to contribute, because of legal reasons.

Having said that, I think it would be good to get ideas on how to solve this problem with the previous two requirements in place. Also we need to understand how critical capturing offensive wording and weight that with the complications that can introduce.

Cheers.

solyarisoftware · October 21, 2019, 5:04pm

Well, if we exclude under 19, if we exclude badwords, if we exclude any possible forbidden/bad content, we’ll produce a biased language model; that’s not the best way to achieve an high quality “democratic” and inclusive (and real spoken language) dataset, I fair.

I do not have quick solutions about how to “balance”/distribute contributes. The rule of thumbs at first for me could be an initial (for language with scarse data), a “take all” strategy.

nukeador · October 21, 2019, 5:09pm

One of the things we are doing in the coming months is to benchmark our dataset with others out in the market.

I wouldn’t assume we are producing better or worse quality without data in front of me to compare.

Do you know other datasets that are including underage and offensive language? Do we know of any study about how this influences the word error ratio?

solyarisoftware · October 21, 2019, 5:25pm

I’ll read with pleasure results of such benchmark, when available!

Do you know other datasets that are including underage and offensive language? Do we know of any study about how this influences the word error ratio?

No, I don’t have scientific results on the topic (is not my field of expertise, sorry). My feeling (to be proved) is that excluding common spoken language abuses, profanity, and above all diversity of speakers (under 19) will create low quality/biased language model for DeepSpeech. Open point.

Mte90 · October 22, 2019, 9:45am

I think that right now is not a problem, looking at the project to grow there are different goals as thinking in agile context.

October community campaign Promote better Common Voice to have more recordings or what we are doing in any case is useless. The corpus in a lot of language is not still at the goal defined 2 years ago when the project started (English as first example of 2000 hours)
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/ the error rate is very low and we can focus as community when all the languages or the majority reached the hours recording goal
Add support for accents 🗣 Feedback needed: Languages and accents strategy because we are excluding some area and also lot of information in the dataset (as example the DS French model use a dataset of African French accents)

Remember for anyone that want to join the discussion that without promotion and awareness about the project it is useless what we are doing. As example now is an year that I invest a lot of time for Italian with the community etc, and I prefer to see more people that use the project and understand why they have to contribute.
The Italian case is emblematic, we have a lot of contributors 799 with 47 hours compared to German with 7300~ and 396 hours or French 7000~ with 267 hours (numbers from https://voice.mozilla.org/it/languages).
We have more people compared to the recording hours of those languages but we are not able to keep contributors be part of the project and I think that we need that to move on the project also to other discussion.

Bad words list is a way to achieve a result to get more people, is true but right now we are not able to keep the people we have to contribute that is the big problem, at least from my point of view.

So getting back this thread to the discussion, maybe we can do in the scraper support for a second list of sentences with profanities etc and based on the age of the user CV will show them.

solyarisoftware · October 23, 2019, 8:17am

Hi Daniele,
Thanks for the info.

It’s totally clear that first of all, for language with scarce data, like Italian, we need more recordings from people to improve language models and slow down the error rate.Yes, we need a lot of dissemination to let people use common voice portal, as much as possible.
My side (I’m new here), I’ll publish in November and article about CV on my blog: https://www.convcomp.it and I am willing to talk, in Italian conferences/meetups/etc., about the topic (Mozilla CV, Mozilla DeepSpeech, Mozilla TTS).

Back to bad words management:

What do you mean with "scraper support?

Anyway, I imagine two phases:

the collection of sentences containing “bad words”
these sentences would be flagged and keep in a set distinct from the “whitout badwords” set (you say “two lists”).
the CV web portal user experience
My porposal here is a change request on the UX:
By default user are not allowed to read sentences with badwords, ok!
But, the user (maybe a registered profiled user) is able to opt-in, avoiding the badword filter (saying/clicking somewhere in profile: “I Take all / I accept badwords”)

BTW, for under 19, I imagine a similar option-in flow (I have to think more about it)

giorgio

Mte90 · October 23, 2019, 8:25am

If you see Bad words words list for your languages first thread there is the scraper I am talking about.

About the implementation it is something that depends on the OKR of CV, I hope that this discussion open some discussions points about it.

Topic		Replies	Views
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2329	March 15, 2023
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	382	February 28, 2025
How to "un-bias" a language? Common Voice	11	935	March 7, 2021
"Offensive" language Common Voice	3	1813	October 23, 2017
CVSS. Is offensive lexicon strictly forbidden or it only should not be addresed to anyone? Common Voice participation , sentence-collection , dataset	3	77	January 30, 2026

Bad words words list for your languages

Related topics