A problem of the Russian letter "Ё"

Hi. Today I want to discusse one of problem of Russian sentence collecting, namely, a problem of the letter “Ё”. We don’t use it in usual life, we use instead a letter “Е”, but should we use it here, in CV? I think, that yes, but I want to listen to other opinions.

Why I think so?

  1. They consist of different sounds (Ё [jo], Е [je]), therefore some words have different pronunciations with and without “Ё”. And “Ё” is stressed vowel always. Example, words “Всё” ([Vs’o], everything) and “Все” ([Vs’e] everyone). If we don’t use “Ё”, both words have same spelling. And I don’t see a reason, why we can’t use different spellings for different pronunciations, when Russian have forms for both of them.
  2. I don’t think native speaker will make mistake with “Все” and “Всё”, even we use only “Все”, but for non-native speakers using of both forms can be usefull.
  3. Russian language have only one such letter, unlike Hebrew (Niqqud) or Arabic (Arabic diacritics), which have systems of symbols for designation of pronunciation, which they don’t use usually. I think, that if somebody want to get texts without “Ё”, change from “Ё” to “Е” is more simple task than if somebody want to get texts with “Ё” and he should be check all texts and change every needful word by hand.

Now Russian part of CV don’t have rule for this, therefore I often see sentences and with and without “Ё”.

  • Всё, добрались наконец-то.
  • Здравствуй, мое предсмертное солнце, солнце Аустерлица!

If we use “Ё” should be:

  • Всё, добрались наконец-то.
  • Здравствуй, моё предсмертное солнце, солнце Аустерлица!

If we don’t use “Ё” should be:

  • Все, добрались наконец-то.
  • Здравствуй, мое предсмертное солнце, солнце Аустерлица!

Not «е and ё», only «ё and ё» or «е and е». We must to decide while we have not so many audio clips and sentences in Russian.

I would like to discusse it with other native speakers here, but Russian part of CV doesn’t exist on Discourse now and I don’t sure, that other Russian contibutors use Discourse.

Hey @Flay

Thanks for posting the following discussion.

I would suggest maybe reaching out to our language advisor @Fran and reposting this discourse link to the community chat to draw in Russian Speakers to engage with the topic.

2 Likes

Hi, thank you. How I can reach with him and what community chat you mean?

Hi @Flay, thank you for your interest in Common Voice!

I don’t think it is necessary to include the ё /jo/ symbol, native speakers and proficient non-native speakers will know from context which sound to pronounce when they see е /je/. When training a model for Russian I would collapse ё and е to make the output alphabet size smaller.

If ё is included in the dataset it is not a problem, as it can always be converted to е when training.

Hi @ftyers

Thank you very much! I understand that users of CV dataset easily can convert all sentences from “ё” to “е” and that native speakers and proficient non-native speakers know, how to pronounce words from context. I speak about little bit another.

I meant, that I don’t see reason for not-using “ё”, cause it can help to separate words with different pronounciations not for readers of sentences, but for dataset and voice technologies. And like you said higher, if users of the dataset want to convert from “ё” to “е” it will be easy task, but if they want to add “ё” to the sentences again it will be more and more hard task, than in the first case.

But main problem is chaos in the sentences. Someone use “ё” and someone don’t. I think that it’s not good, thehefore I asked about this problem here. I think if we can unify sentences in the dataset it will be very helpful. And therefore I want to decide with community for the Russian language, should we use “ё” always or never.

Hey @Flay, I think the problem you are emphasizing is not specific to a single letter or language. It happens in many languages, at least in Turkish. Similar stuff are what I had to ask @ftyers last year.

Sometimes the “rules” get changed by a governing body, sometimes by its own, and if we use old text (e.g. works from 70+ years before which became public domain) we cannot have current rules.

For example, in Turkish the use of “hats” on letters a, u, ı (=> â, û, î) changed like a flip-flop every few years, and some of the editors use the current rules, some do insist on rules before 1980 coup. E.g. if we get rid of “â” and replace it with “a”, “hâlâ” (=> still) becomes “hala” (=> aunt). In addition, those accents need extra keystrokes on our keyboards, so many people started to ignore them, if you are collecting sentences from everyday people.

I’m with you about setting ground rules on the alphabet, writing and grammar and publish them (I did similar stuff in the Turkish sub-Discourse). What I basically emphasized is: “use current rules while generating sentences, but be flexible when validating”.

On the other hand, most validators would not reach that info and use rules they think they are right.

So, the only option one can use is post-release/pre-training normalization, as @ftyers emphasized. Also, for machine models, if the last layer - the alphabet is smaller, you get better results in general.

1 Like

Yep, I don’t think it’s a big problem. It’s not really chaos. Also, if we change the sentences that are in the dataset then we lose the mapping to the recordings because it isn’t possible to update sentences. It would be nice if that were a feature but it is not.

So it’s up to the user of the dataset to decide and to normalise one way or the other if they like and if it’s useful for their use case. If you have a particular use case or application in mind for why you would definitely want ё, I can suggest some approaches you can use to add it to sentences that don’t have it.

1 Like

Sorry for diverting from the main topic…

Also, if we change the sentences that are in the dataset then we lose the mapping to the recordings because it isn’t possible to update sentences.

This might be impossible and would require post moderation of the whole set, not a possible way if the dataset is large. But it can be done gradually…

One thing that is needed is the ability to change obviously wrong text, e.g. remnants from bad OCR detection, which pass through our eyes during Sentence Collector validation. Such as “rı => n” or “nın => mn”… I read every sentence 2-3 times before adding to sentence collector, two people further validate, but some may pass. I have a zero error policy, which is hard to reach.

The problem is: While recording, people correct these and say the correct one. Obviously the sounds would not match the transcription. I think we should be able to correct them (related issue on github).

It’s definitely possible and desirable, but would need to be scoped out as a new feature in the platform. It might even be that this could be included as part of some other feature, so follow that issue on GitHub or make a new issue if it’s not quite right and link it.