Bad words words list for your languages

Right now the https://github.com/Common-Voice/common-voice-wiki-scraper include a blacklist word generator but can be not perfect or not detected badwords or other kind of stuff. Wikipedia doesn’t include them but can happen so how we can avoid it?

There are different project on github but I found only this one that support other languages than english https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/

I am contributing now to improve the italian on, and maybe we can improve the tool and aggregate this words to the blacklist.

I have checked out that github project and spanish list does not seem appropiate to me.

It includes some gross language (as expected) but also some common words like ‘asesinato’ (murder).

An overview of another lists (‘pt’ for example) shows similar trends. English list is well compilated however.

Perhaps we could collect some samples from Wordpress plugins or profanity prevention editors. Heuristic classificators for web filtering could also have an useful list. The old DansGuardian has got this pretty complete badwords list: http://contentfilter.futuragts.com/phraselists/

1 Like

I saw that the Italian list is not so complete but I think that is something to start or maybe to contribute too.
The point is that sometimes badwords are verb or other words that based of the context they are dirty words so this can create issues for the purpose of this project.
I think that is enough to do a list of badwords and other dirty words that we want to be sure are not included in the sentence to read in common voice.

Hi Daniele,

What’s the reason to blacklist badwords?

If common voice goal is to create a sentences dataset of real spoken language, so badwords are (unfortunately) big part of real common spoken languages and they have to be part of the dataset. Otherwise we are “biasing” the dataset and that’s usually considered an huge defect of any ML-platform training data set (e.g. DeepSpeech in our case).

BTW, there is any Common Voice documentation I can read, about words/topics blacklist filtering decisions/regulations? Thanks.

Conversely, I understand the need to do not propose offensive sentence to anyone that contribute via common voice website. Maybe users would be option-in.

BTW, for Italian language, I used in past this badword list https://github.com/napolux/paroleitaliane/blob/master/paroleitaliane/lista_badwords.txt

giorgio

The point is that Common Voice need it because doesn’t have any parental advisory.
So the project is suitable by everyone and for D&I we need to be sure about the content. Take as example if we do an event about CV in a middle school with kids?

1 Like

I see, but if you cut here and there, for any ethical apparently good reason, the language model will be biased. That’s bad.

Maybe the parental control issue could be mitigated by an opt-in flag in the common voice website UX. By example any adult contents could be disabled by default, but website users must be allowed (details to be defined) to manage any kind of language content.

I renew my ask: there is any Common Voice regulation/decision document on content filtering?

Thanks

Right now in CV there is anything about it so the discussion was opened to define this point, in the past this problem was raised few times with no official statement about it.

So until CV doesn’t do anything we need to adapt and find a way :slight_smile:

2 Likes

Hi,

So, yes we don’t want offensive language to be part of the Common Voice site experience.

At the same time, @Mte90 CV site doesn’t allow people under 19 to contribute, because of legal reasons.

Having said that, I think it would be good to get ideas on how to solve this problem with the previous two requirements in place. Also we need to understand how critical capturing offensive wording and weight that with the complications that can introduce.

Cheers.

Well, if we exclude under 19, if we exclude badwords, if we exclude any possible forbidden/bad content, we’ll produce a biased language model; that’s not the best way to achieve an high quality “democratic” and inclusive (and real spoken language) dataset, I fair.

I do not have quick solutions about how to “balance”/distribute contributes. The rule of thumbs at first for me could be an initial (for language with scarse data), a “take all” strategy.

One of the things we are doing in the coming months is to benchmark our dataset with others out in the market.

I wouldn’t assume we are producing better or worse quality without data in front of me to compare.

Do you know other datasets that are including underage and offensive language? Do we know of any study about how this influences the word error ratio?

1 Like

I’ll read with pleasure results of such benchmark, when available!

Do you know other datasets that are including underage and offensive language? Do we know of any study about how this influences the word error ratio?

No, I don’t have scientific results on the topic (is not my field of expertise, sorry). My feeling (to be proved) is that excluding common spoken language abuses, profanity, and above all diversity of speakers (under 19) will create low quality/biased language model for DeepSpeech. Open point.

I think that right now is not a problem, looking at the project to grow there are different goals as thinking in agile context.

Remember for anyone that want to join the discussion that without promotion and awareness about the project it is useless what we are doing. As example now is an year that I invest a lot of time for Italian with the community etc, and I prefer to see more people that use the project and understand why they have to contribute.
The Italian case is emblematic, we have a lot of contributors 799 with 47 hours compared to German with 7300~ and 396 hours or French 7000~ with 267 hours (numbers from https://voice.mozilla.org/it/languages).
We have more people compared to the recording hours of those languages but we are not able to keep contributors be part of the project and I think that we need that to move on the project also to other discussion.

Bad words list is a way to achieve a result to get more people, is true but right now we are not able to keep the people we have to contribute that is the big problem, at least from my point of view.

So getting back this thread to the discussion, maybe we can do in the scraper support for a second list of sentences with profanities etc and based on the age of the user CV will show them.

Hi Daniele,
Thanks for the info.

It’s totally clear that first of all, for language with scarce data, like Italian, we need more recordings from people to improve language models and slow down the error rate.Yes, we need a lot of dissemination to let people use common voice portal, as much as possible.
My side (I’m new here), I’ll publish in November and article about CV on my blog: https://www.convcomp.it and I am willing to talk, in Italian conferences/meetups/etc., about the topic (Mozilla CV, Mozilla DeepSpeech, Mozilla TTS).

Back to bad words management:

What do you mean with "scraper support?

Anyway, I imagine two phases:

  1. the collection of sentences containing “bad words”
    these sentences would be flagged and keep in a set distinct from the “whitout badwords” set (you say “two lists”).

  2. the CV web portal user experience
    My porposal here is a change request on the UX:
    By default user are not allowed to read sentences with badwords, ok!
    But, the user (maybe a registered profiled user) is able to opt-in, avoiding the badword filter (saying/clicking somewhere in profile: “I Take all / I accept badwords”)

BTW, for under 19, I imagine a similar option-in flow (I have to think more about it)

giorgio

If you see Bad words words list for your languages first thread there is the scraper I am talking about.

About the implementation it is something that depends on the OKR of CV, I hope that this discussion open some discussions points about it.