I found great public domain collection of Russian sentences.
https://tatoeba.org/eng/
The whole DB can be downloaded here
https://tatoeba.org/rus/terms_of_use
https://tatoeba.org/rus/downloads
I found great public domain collection of Russian sentences.
https://tatoeba.org/eng/
The whole DB can be downloaded here
https://tatoeba.org/rus/terms_of_use
https://tatoeba.org/rus/downloads
That’s a great find @Roman_Frantov, and thanks for looking into this.
We have investigated Tatoeba in the past, and the problem is that their license CC-2, is not compatible with our CC-0 license. That project is amazing though, so I will reach out to them and see if we can work something out.
Sorry, can’t really tell the difference between cc-2 and cc-0 in this case. Can I be of help communicating with them or we shall find another corpus?
Thank you @Roman_Frantov for the offer to help! Are you part of the Tatoeba community?
We are already talking to Trang, the founder of Tatoeba, about collaboration. The conversations so far are sounding very positive, and we may start some sort of collaboration in early 2018.
Hi! This is an interesting part. How to manage “kinda”, " dunno"? Another good case is " ain’t", different by nature though. Anyway, this is the easiest part. Voice music is more challenging.
My guessing is that professional assistance is required at least for cases ( nominative, dative etc.) of numerals. They are indeed complicated, some are being distorted in colloquial language. Personally, I wouldn’t decide in this capacity.
Any news about russian speech? How can we help to start Russian localization?
Hi Nikita, before a language can be released it is needed to gather written sentences people will be able to speak.
You can contribute written sentences on this website : https://voice-sprint.mozilla.community/
Even though it is written “10-11th May”, the website still works and you can contribute there.
Once enough sentences will be gathered, the Russian language should be available for speech contribution.
Be aware though that the sentences you contribute must be free of copyright.
Thanks you for contributing to this project :).
Thanks. I will try to contribute as much as possible.
How can I check current work process? How many sentences are already collected for russian? What is the goal? This information can be very useful for contributors
As a contributor myself I do not have access to those informations.
In my opinion, as Russian is spoken by a lot of people it shouldn’t be too long before it is released.
Note though that you have to be on the russian version of the Common Voice website to access the russian contribution section for voice.
Elsewhere it was said that more sentences is better, but these things will allow a language to be initially released for starting to gather some recordings:
I also want to push Russian forward.
It’s a few week since the last reply, could we please recap what do we have to do to get started?
What I read so far:
Is this correct? Could please someone from Mozille confirm this is the way to go and clarify how could we contribute texts or sentences.
There are many texts of classical Russian literature which are now in Public Domain. The language there may be not the most modert, but I think this is a good way to start with something.
The best way to contribute right now would be to find and review (or write) sentences in the public domain, and submit at PR to the main repo here: https://github.com/mozilla/voice-web/tree/master/server/data
Soon, though, we hope to have some better tools for reviewing and filtering bad sentence. See here for a discussion around that:
FYI I’m starting working on this:
My goal with this issue is to provide 500 sentences for starters. My approach is to translate/adapt German sentences I dictate anyway. I think this should give me a good overview of which sentences are suitable.
I’ve got the first 500 sentences:
Can we use transcripts from Russian Duma? I can parse it from here http://transcript.duma.gov.ru/ and then cut to sentences.
I believe it’s PD according to article 1259 of Book IV of the Civil Code of the Russian Federation No. 230-FZ of December 18, 2006, which mentions “official documents of state government agencies and local government agencies of municipal formations, including laws, other legal texts, judicial decisions, other materials of legislative, administrative and judicial character, official documents of international organizations, as well as their official translations”.
We can also use other transcripts, from Federation Council and regional parliaments for example.
It fits pretty well, because it’s a live speech, and with it we can build really big dataset.
I would check with a local legal expert to confirm this material is under public domain. If that’s the case, yes
Finally, I found source which we can use without questions. Oral History Foundation http://oralhistory.ru/ provides records of conversations with notable people. Files released on their website is under CC BY-SA 4.0, but they partnered with Wikimedia RU in 2014 and uploaded some of them to Wikimedia Commons under CC0 1.0 https://commons.wikimedia.org/wiki/Category:Oral_History_voice_samples . We can probably use audio files as well, but for now I am going to parse transcriptions and put it in Sentence Collector.
For another source of PD Russian-language text, we can try to use Voice of America content. It’s PD because it’s a work of US govt employee, but it doesn’t fit well as it not a live speech. They had radio broadcast in Soviet-early Russia times which would fit better, but I can’t find archives easily. Here is VOA Russian-language website btw https://www.golos-ameriki.ru .
And I found another big source for Russian sentences. United Nations documents published under PD, and there is already tagged corpus available https://cms.unov.org/UNCorpus/ I wrote a script that extract only proces-verbaux (transcripts) records from corpus and validate it using same method sentence-collector use, and got more than 300k unique sentences. Not all sentences are good, so they need to be validated by human additionally. Should I upload it to sentence-collector fully?
It also can be useful for other United Nations official languages (Arabic, English, Spanish, French, Chinese).