Let’s start contributing Russian speech to the project!
We need projects like this due to complexity of language. How can I be of help?
Thanks for your interest in this project!
Right now, beyond copying the code and setting up your own version of the Common Voice website with all the sentences translated, there is not much we have in place to help you. On the bright side, we plan on doing a big push for localization, and enabling more communities soon. Sadly, this won’t happen for another couple of months.
In the meantime, what you can do is look for a large collection of sentences in Russian that are part of the public domain. Once we have that, collecting voice samples is the easier part.
Thank you for your reply!
I will be grateful for any help! I want to do a local instance of Common Voice and ready to invest in translating sentences.
How can we start?
Localizing and spinning up a new instance of the Common Voice website is the easy part. The hard part is finding a large collection of public domain Russian sentences for your visitors to read. I’d start by looking for this.
Translating the English sentences we use isn’t a good approach because we don’t have enough sentences yet, and we want typical Russian way of talking (not translated Russian way of talking).
Once we find a large collection of these Russian sentences we can use, I can guide you through getting a copy of Common Voice up and running.
You’re right. The languages are so different that we can’t sometimes directly translate a sentence - we need to alternate the meaning a bit.
Where did you get your sentences?
I believe forums and news are not the best idea, right?
Books? Call center calls?
I have an access to a quality monitoring recording of one Outsourcing contact center. But the speech there is quite boring
We have been collecting them from users lately, but in the past I have used public domain books and movies (War of the Worlds, It’s a Wonderful Life). The problem is that those texts are old and the language is outdated, so we are trying to get people to donate stuff they would say. It hasn’t been a very scaleable approach yet, but I think it’s a good direction.
It really depends. Either can be pretty technical and use a lot of proper nouns, which is less useful I think. But the right forum might have some good stuff, you just gotta find it.
Books also have the “not very conversational” problem. Movies scripts are better. Call center calls are really good, that’s how google trained their original engine.
Is it public domain? License free? If so, this would be an amazing source granted we could use it.
I found great public domain collection of Russian sentences.
The whole DB can be downloaded here
That’s a great find @Roman_Frantov, and thanks for looking into this.
We have investigated Tatoeba in the past, and the problem is that their license CC-2, is not compatible with our CC-0 license. That project is amazing though, so I will reach out to them and see if we can work something out.
Sorry, can’t really tell the difference between cc-2 and cc-0 in this case. Can I be of help communicating with them or we shall find another corpus?
Thank you @Roman_Frantov for the offer to help! Are you part of the Tatoeba community?
We are already talking to Trang, the founder of Tatoeba, about collaboration. The conversations so far are sounding very positive, and we may start some sort of collaboration in early 2018.
@mhenretty I joined the community recently. Good to hear that you’re moving on with them!
Hi! This is an interesting part. How to manage “kinda”, " dunno"? Another good case is " ain’t", different by nature though. Anyway, this is the easiest part. Voice music is more challenging.
My guessing is that professional assistance is required at least for cases ( nominative, dative etc.) of numerals. They are indeed complicated, some are being distorted in colloquial language. Personally, I wouldn’t decide in this capacity.
Any news about russian speech? How can we help to start Russian localization?
Hi Nikita, before a language can be released it is needed to gather written sentences people will be able to speak.
You can contribute written sentences on this website : https://voice-sprint.mozilla.community/
Even though it is written “10-11th May”, the website still works and you can contribute there.
Once enough sentences will be gathered, the Russian language should be available for speech contribution.
Be aware though that the sentences you contribute must be free of copyright.
Thanks you for contributing to this project :).
Thanks. I will try to contribute as much as possible.
How can I check current work process? How many sentences are already collected for russian? What is the goal? This information can be very useful for contributors
As a contributor myself I do not have access to those informations.
In my opinion, as Russian is spoken by a lot of people it shouldn’t be too long before it is released.
Note though that you have to be on the russian version of the Common Voice website to access the russian contribution section for voice.
Elsewhere it was said that more sentences is better, but these things will allow a language to be initially released for starting to gather some recordings:
- the website fully translated on http://pontoon.mozilla.org/
- at least 2000 sentences (and more in pipeline for review)
I also want to push Russian forward.
It’s a few week since the last reply, could we please recap what do we have to do to get started?
What I read so far:
- Translate the website https://voice.mozilla.org using https://pontoon.mozilla.org/. Not quite sure what is mean here, https://voice.mozilla.org/ru seems to be translated.
- Deliver 2000+ Russian sentences (public domain texts only) - in which form, how exactly?
Is this correct? Could please someone from Mozille confirm this is the way to go and clarify how could we contribute texts or sentences.
There are many texts of classical Russian literature which are now in Public Domain. The language there may be not the most modert, but I think this is a good way to start with something.
The best way to contribute right now would be to find and review (or write) sentences in the public domain, and submit at PR to the main repo here: https://github.com/mozilla/voice-web/tree/master/server/data
Soon, though, we hope to have some better tools for reviewing and filtering bad sentence. See here for a discussion around that: