Russian speech

Roman_Frantov · August 30, 2017, 9:28am

Hi there!
Let’s start contributing Russian speech to the project!
We need projects like this due to complexity of language. How can I be of help?

mhenretty · September 25, 2017, 1:40pm

Hi Roman,

Thanks for your interest in this project!

Right now, beyond copying the code and setting up your own version of the Common Voice website with all the sentences translated, there is not much we have in place to help you. On the bright side, we plan on doing a big push for localization, and enabling more communities soon. Sadly, this won’t happen for another couple of months.

In the meantime, what you can do is look for a large collection of sentences in Russian that are part of the public domain. Once we have that, collecting voice samples is the easier part.

Roman_Frantov · September 26, 2017, 1:57pm

Hi Michael!
Thank you for your reply!

I will be grateful for any help! I want to do a local instance of Common Voice and ready to invest in translating sentences.

How can we start?

BR

Roman

mhenretty · September 26, 2017, 2:41pm

Localizing and spinning up a new instance of the Common Voice website is the easy part. The hard part is finding a large collection of public domain Russian sentences for your visitors to read. I’d start by looking for this.

Translating the English sentences we use isn’t a good approach because we don’t have enough sentences yet, and we want typical Russian way of talking (not translated Russian way of talking).

Once we find a large collection of these Russian sentences we can use, I can guide you through getting a copy of Common Voice up and running.

Roman_Frantov · September 26, 2017, 2:55pm

You’re right. The languages are so different that we can’t sometimes directly translate a sentence - we need to alternate the meaning a bit.
Where did you get your sentences?

I believe forums and news are not the best idea, right?

Books? Call center calls?

I have an access to a quality monitoring recording of one Outsourcing contact center. But the speech there is quite boring

mhenretty · September 27, 2017, 9:16pm

We have been collecting them from users lately, but in the past I have used public domain books and movies (War of the Worlds, It’s a Wonderful Life). The problem is that those texts are old and the language is outdated, so we are trying to get people to donate stuff they would say. It hasn’t been a very scaleable approach yet, but I think it’s a good direction.
https://github.com/mozilla/voice-web/issues/341

It really depends. Either can be pretty technical and use a lot of proper nouns, which is less useful I think. But the right forum might have some good stuff, you just gotta find it.

Books also have the “not very conversational” problem. Movies scripts are better. Call center calls are really good, that’s how google trained their original engine.

Is it public domain? License free? If so, this would be an amazing source granted we could use it.

Roman_Frantov · September 28, 2017, 2:43pm

I found great public domain collection of Russian sentences.
https://tatoeba.org/eng/
The whole DB can be downloaded here

https://tatoeba.org/rus/terms_of_use
https://tatoeba.org/rus/downloads

mhenretty · September 29, 2017, 11:05am

That’s a great find @Roman_Frantov, and thanks for looking into this.

We have investigated Tatoeba in the past, and the problem is that their license CC-2, is not compatible with our CC-0 license. That project is amazing though, so I will reach out to them and see if we can work something out.

Roman_Frantov · October 20, 2017, 2:40pm

Sorry, can’t really tell the difference between cc-2 and cc-0 in this case. Can I be of help communicating with them or we shall find another corpus?

mhenretty · October 23, 2017, 2:17pm

Thank you @Roman_Frantov for the offer to help! Are you part of the Tatoeba community?

We are already talking to Trang, the founder of Tatoeba, about collaboration. The conversations so far are sounding very positive, and we may start some sort of collaboration in early 2018.

Roman_Frantov · October 26, 2017, 10:14am

@mhenretty I joined the community recently. Good to hear that you’re moving on with them!

RLarissa · December 5, 2017, 10:15pm

Hi! This is an interesting part. How to manage “kinda”, " dunno"? Another good case is " ain’t", different by nature though. Anyway, this is the easiest part. Voice music is more challenging.

RLarissa · December 6, 2017, 4:03pm

My guessing is that professional assistance is required at least for cases ( nominative, dative etc.) of numerals. They are indeed complicated, some are being distorted in colloquial language. Personally, I wouldn’t decide in this capacity.

nikita.mekh · June 11, 2018, 9:33am

Any news about russian speech? How can we help to start Russian localization?

luc.salommez · June 11, 2018, 8:24pm

Hi Nikita, before a language can be released it is needed to gather written sentences people will be able to speak.

You can contribute written sentences on this website : https://voice-sprint.mozilla.community/
Even though it is written “10-11th May”, the website still works and you can contribute there.

Once enough sentences will be gathered, the Russian language should be available for speech contribution.

Be aware though that the sentences you contribute must be free of copyright.

Thanks you for contributing to this project :).

nikita.mekh · June 15, 2018, 1:36pm

Thanks. I will try to contribute as much as possible.
How can I check current work process? How many sentences are already collected for russian? What is the goal? This information can be very useful for contributors

luc.salommez · June 15, 2018, 2:43pm

As a contributor myself I do not have access to those informations.
In my opinion, as Russian is spoken by a lot of people it shouldn’t be too long before it is released.

Note though that you have to be on the russian version of the Common Voice website to access the russian contribution section for voice.

odinho · June 23, 2018, 9:43pm

Elsewhere it was said that more sentences is better, but these things will allow a language to be initially released for starting to gather some recordings:

the website fully translated on http://pontoon.mozilla.org/
at least 2000 sentences (and more in pipeline for review)

Orless · August 6, 2018, 10:18am

I also want to push Russian forward.

It’s a few week since the last reply, could we please recap what do we have to do to get started?

What I read so far:

Translate the website https://voice.mozilla.org using https://pontoon.mozilla.org/. Not quite sure what is mean here, https://voice.mozilla.org/ru seems to be translated.
Deliver 2000+ Russian sentences (public domain texts only) - in which form, how exactly?

Is this correct? Could please someone from Mozille confirm this is the way to go and clarify how could we contribute texts or sentences.

There are many texts of classical Russian literature which are now in Public Domain. The language there may be not the most modert, but I think this is a good way to start with something.

mhenretty · August 6, 2018, 4:14pm

The best way to contribute right now would be to find and review (or write) sentences in the public domain, and submit at PR to the main repo here: common-voice/server/data at master · common-voice/common-voice · GitHub

Soon, though, we hope to have some better tools for reviewing and filtering bad sentence. See here for a discussion around that:

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14459	May 10, 2022
Problems finding public domain sentences Common Voice sentence-collection	26	3085	June 10, 2019
Languages addressed Common Voice	24	3897	May 15, 2018
How can I send sentences to contribute? Common Voice sentence-collection	7	2020	September 5, 2018
Common voice sentences are the opposite of "common" Common Voice participation , sentence-collection , feedback , issue	27	3908	September 7, 2024

Russian speech

Related topics