Brackets in sentences

Hi! Why brackets “(”, “)” are special symbols and Sentence Collector don’t accept sentences with them? It’s usual stop. Or am I wrong?
I understood it, when I wanted to add sentence on Russian: «Некоторые люди слишком сильно беспокоятся о выборе ав (аватарок) на свои страницы».

Some personal notes - not answering your question, I know:

  1. It is done here (Russian rules, same as default):
    https://github.com/common-voice/sentence-collector/blob/85c1a455d27638c992d699d453b1cfd168ed089c/server/lib/validation/languages/ru.js#L19
  2. You can preprocess your text to change (abc) to -abc- for example to overcome this (if you don’t change the above rule).
  3. “Wild guess” for the reason: Sentences with () are mainly not conversational, and they disrupt the natural speech pattern of the person speaking into the dataset. Most of them come out of books etc, and they are mostly descriptive (e.g. explaining the previous word, to give an alternative etc). If this is the case, you might like to remove them altogether.

As I’m processing all the text corpus 2-3 times offline, I use option 2 or 3.

2 Likes

In your case, I’d change “av” to “avatars” and remove the parenthesis (Google translated). Actually “av” is not a word, it is an abbreviation and shouldn’t be there…

Some people worry too much about choosing av (avatars) for their pages

The main reason to disallow parenthesis was that some recorded it with, and some without saying “in parenthesis” or similar.

3 Likes

Thank you for rules. And I know that I can remove brackets, just I was surprised and wanted to ask why it’s so.

Okay, thank you, I understood :slight_smile:

In Russian “av” (initial form: ava /ru-page/) it’s not abbreviation like IDK (I don’t know), but like u (you). It don’t have variant reading cause It’s clipping

I can not comment on the validity, of course (I also thought it was clipping).

Variation reading is main reason why we don’t want use abbreviations in Common Voice. We can’t read clippings like abbreviations in Russian. Clippings are just more shorter version of the words. They aren’t abbreviations and they usually have only one pronunciation like normal words. Therefore I think, that clippings is OK =)