CVSS. Is offensive lexicon strictly forbidden or it only should not be addresed to anyone?

I know guidelines, but it’s unclear from them what “Offensive” means exactly. Does it include all “Offensive lexicon” in general or it means, that this lexicon should not be used seriously?

I wanted to write a next “question” to Russian CVSS “Read this prompt aloud and only after that answer to it. Say some 5-6 random offensive words”. In this case it will be clear both for recorders and transcribers, that these offensive words are not addressed to anyone specific and only said to collect them.

I have this question, because I try to write prompts now to get more diverse lexicon in answers including professional jargon, slang etc. The guidelines ask not to add “qustions which are very culturally specific, or make a lot of assumptions about the responder” (such as some specific profession I guess), so I use all my ingenuity to write general questions/prompts in a way that still get words that are used by people with specific knowledge/background and will include more word diversity. For instance:

  • Open a random post from the social network you use. Now briefly describe this random post. What is it about, or who created it? (need some preparation, but can be answered in short answer and potentially can increase diversity of words in answers)
  • It is not recommended here to add questions that are only understood by a narrow circle of people, as not everyone responding may be familiar with the topic of such a specific question. But let’s assume there are no such limitations, and you know for sure that your interlocutor is immersed in the context. What question would you ask them?

I wanted to do similar thing with offensive words as you can see, because I think it is important to collect them for training models, which can find and report/ban this type of speech. I know that this is the sensitive theme, so wanted to check before adding what is the current policy. But I found only this topic: "Offensive" language and it is created in 2017, so maybe something is outdated or is not relevant cause CVSS didn’t exist.

Also it still doesn’t clear, what offensive exactly mean? It is personal/group insult or answers like “Fuck! I forgot the right word!” should be reported as well?

Yes, sensitive :slight_smile: - and totally subjective…
But for SPAM protection AI training you would need SPAM emails, for auto inserting “beeps” in speech tech, you would need to have a separate dataset and TAG them…

Would you like to have these datasets?

  • ru-offensive
  • ru-engineering
  • ru-medical

IMO, like dictionaries, these are also different datasets - which if required can be merged.
In my Turkish text-corpus work a couple years ago, I wrote some samples from everyday swear words (not offending anybody, but what we usually use in our everyday speech), most got reported, but all are recorded. Unlike Scripted Speech, in Spontaneous Speech, if reported, it will be taken out of circulation, so it will be lost.

Wait a bit more please - you can see an announcement in your new group one day :slight_smile:

You may also like to read my (unofficial) views on this topic:

I want to increase diversity of language situations/words covered by dataset. Not only something specific.

Yeah, you need specific solutions for specific situations, but always there are some common solutions and I thought about CV as one of them. Isn’t it a goal of Common Voice to collect all types of spontaneous speech to made dataset for ASR models, that can convert all possible speech to text? If not, then what are your specific goals? Which cases do you try to cover by collecting this dataset?

In this specific case I mentioned, I meant that when model which is trained on CV will be able to transcript all possible words, you can easily to add/train other model that will analyze your transcript based on your needs such as finding offensive words, off topic answer etc.

Finally I see at least something about domains, thanks… Why is so hard to find some concrete information about them? On page for adding sentences I see only their list, but not an explanation what are they used for and in guidelines they just are mentioned as “theme of sentense” without any explanation of this “feature”. But I think that my misunderstanding of that is too huge for this answer. I will open a separate theme for it