Russian speech


(Roman Frantov) #1

Hi there!
Let’s start contributing Russian speech to the project!
We need projects like this due to complexity of language. How can I be of help?


(Michael Henretty) #2

Hi Roman,

Thanks for your interest in this project!

Right now, beyond copying the code and setting up your own version of the Common Voice website with all the sentences translated, there is not much we have in place to help you. On the bright side, we plan on doing a big push for localization, and enabling more communities soon. Sadly, this won’t happen for another couple of months.

In the meantime, what you can do is look for a large collection of sentences in Russian that are part of the public domain. Once we have that, collecting voice samples is the easier part.


(Roman Frantov) #3

Hi Michael!
Thank you for your reply!

I will be grateful for any help! I want to do a local instance of Common Voice and ready to invest in translating sentences.

How can we start?

BR

Roman


(Michael Henretty) #4

Localizing and spinning up a new instance of the Common Voice website is the easy part. The hard part is finding a large collection of public domain Russian sentences for your visitors to read. I’d start by looking for this.

Translating the English sentences we use isn’t a good approach because we don’t have enough sentences yet, and we want typical Russian way of talking (not translated Russian way of talking).

Once we find a large collection of these Russian sentences we can use, I can guide you through getting a copy of Common Voice up and running.


(Roman Frantov) #5

You’re right. The languages are so different that we can’t sometimes directly translate a sentence - we need to alternate the meaning a bit.
Where did you get your sentences?

I believe forums and news are not the best idea, right?

Books? Call center calls?

I have an access to a quality monitoring recording of one Outsourcing contact center. But the speech there is quite boring :slight_smile:


(Michael Henretty) #7

We have been collecting them from users lately, but in the past I have used public domain books and movies (War of the Worlds, It’s a Wonderful Life). The problem is that those texts are old and the language is outdated, so we are trying to get people to donate stuff they would say. It hasn’t been a very scaleable approach yet, but I think it’s a good direction.
https://github.com/mozilla/voice-web/issues/341

It really depends. Either can be pretty technical and use a lot of proper nouns, which is less useful I think. But the right forum might have some good stuff, you just gotta find it.

Books also have the “not very conversational” problem. Movies scripts are better. Call center calls are really good, that’s how google trained their original engine.

Is it public domain? License free? If so, this would be an amazing source granted we could use it.


(Roman Frantov) #8

I found great public domain collection of Russian sentences.
https://tatoeba.org/eng/
The whole DB can be downloaded here

https://tatoeba.org/rus/terms_of_use
https://tatoeba.org/rus/downloads


(Michael Henretty) #9

That’s a great find @Roman_Frantov, and thanks for looking into this.

We have investigated Tatoeba in the past, and the problem is that their license CC-2, is not compatible with our CC-0 license. That project is amazing though, so I will reach out to them and see if we can work something out.


(Roman Frantov) #10

Sorry, can’t really tell the difference between cc-2 and cc-0 in this case. Can I be of help communicating with them or we shall find another corpus?


(Michael Henretty) #11

Thank you @Roman_Frantov for the offer to help! Are you part of the Tatoeba community?

We are already talking to Trang, the founder of Tatoeba, about collaboration. The conversations so far are sounding very positive, and we may start some sort of collaboration in early 2018.


(Roman Frantov) #12

@mhenretty I joined the community recently. Good to hear that you’re moving on with them!


(Лариса) #13

Hi! This is an interesting part. How to manage “kinda”, " dunno"? Another good case is " ain’t", different by nature though. Anyway, this is the easiest part. Voice music is more challenging.


(Лариса) #14

My guessing is that professional assistance is required at least for cases ( nominative, dative etc.) of numerals. They are indeed complicated, some are being distorted in colloquial language. Personally, I wouldn’t decide in this capacity.


(Nikita Mekh) #15

Any news about russian speech? How can we help to start Russian localization?


(Luc Salommez) #16

Hi Nikita, before a language can be released it is needed to gather written sentences people will be able to speak.

You can contribute written sentences on this website : https://voice-sprint.mozilla.community/
Even though it is written “10-11th May”, the website still works and you can contribute there.

Once enough sentences will be gathered, the Russian language should be available for speech contribution.

Be aware though that the sentences you contribute must be free of copyright.

Thanks you for contributing to this project :).


(Nikita Mekh) #17

Thanks. I will try to contribute as much as possible.
How can I check current work process? How many sentences are already collected for russian? What is the goal? This information can be very useful for contributors


(Luc Salommez) #18

As a contributor myself I do not have access to those informations.
In my opinion, as Russian is spoken by a lot of people it shouldn’t be too long before it is released.

Note though that you have to be on the russian version of the Common Voice website to access the russian contribution section for voice.