Let’s start contributing Russian speech to the project!
We need projects like this due to complexity of language. How can I be of help?
Thanks for your interest in this project!
Right now, beyond copying the code and setting up your own version of the Common Voice website with all the sentences translated, there is not much we have in place to help you. On the bright side, we plan on doing a big push for localization, and enabling more communities soon. Sadly, this won’t happen for another couple of months.
In the meantime, what you can do is look for a large collection of sentences in Russian that are part of the public domain. Once we have that, collecting voice samples is the easier part.
Thank you for your reply!
I will be grateful for any help! I want to do a local instance of Common Voice and ready to invest in translating sentences.
How can we start?
Localizing and spinning up a new instance of the Common Voice website is the easy part. The hard part is finding a large collection of public domain Russian sentences for your visitors to read. I’d start by looking for this.
Translating the English sentences we use isn’t a good approach because we don’t have enough sentences yet, and we want typical Russian way of talking (not translated Russian way of talking).
Once we find a large collection of these Russian sentences we can use, I can guide you through getting a copy of Common Voice up and running.
You’re right. The languages are so different that we can’t sometimes directly translate a sentence - we need to alternate the meaning a bit.
Where did you get your sentences?
I believe forums and news are not the best idea, right?
Books? Call center calls?
I have an access to a quality monitoring recording of one Outsourcing contact center. But the speech there is quite boring
We have been collecting them from users lately, but in the past I have used public domain books and movies (War of the Worlds, It’s a Wonderful Life). The problem is that those texts are old and the language is outdated, so we are trying to get people to donate stuff they would say. It hasn’t been a very scaleable approach yet, but I think it’s a good direction.
It really depends. Either can be pretty technical and use a lot of proper nouns, which is less useful I think. But the right forum might have some good stuff, you just gotta find it.
Books also have the “not very conversational” problem. Movies scripts are better. Call center calls are really good, that’s how google trained their original engine.
Is it public domain? License free? If so, this would be an amazing source granted we could use it.
I found great public domain collection of Russian sentences.
The whole DB can be downloaded here
That’s a great find @Roman_Frantov, and thanks for looking into this.
We have investigated Tatoeba in the past, and the problem is that their license CC-2, is not compatible with our CC-0 license. That project is amazing though, so I will reach out to them and see if we can work something out.
Sorry, can’t really tell the difference between cc-2 and cc-0 in this case. Can I be of help communicating with them or we shall find another corpus?
Thank you @Roman_Frantov for the offer to help! Are you part of the Tatoeba community?
We are already talking to Trang, the founder of Tatoeba, about collaboration. The conversations so far are sounding very positive, and we may start some sort of collaboration in early 2018.
@mhenretty I joined the community recently. Good to hear that you’re moving on with them!
Hi! This is an interesting part. How to manage “kinda”, " dunno"? Another good case is " ain’t", different by nature though. Anyway, this is the easiest part. Voice music is more challenging.
My guessing is that professional assistance is required at least for cases ( nominative, dative etc.) of numerals. They are indeed complicated, some are being distorted in colloquial language. Personally, I wouldn’t decide in this capacity.