Russian speech

FYI I’m starting working on this:

My goal with this issue is to provide 500 sentences for starters. My approach is to translate/adapt German sentences I dictate anyway. I think this should give me a good overview of which sentences are suitable.

1 Like

I’ve got the first 500 sentences:

Can we use transcripts from Russian Duma? I can parse it from here http://transcript.duma.gov.ru/ and then cut to sentences.

I believe it’s PD according to article 1259 of Book IV of the Civil Code of the Russian Federation No. 230-FZ of December 18, 2006, which mentions “official documents of state government agencies and local government agencies of municipal formations, including laws, other legal texts, judicial decisions, other materials of legislative, administrative and judicial character, official documents of international organizations, as well as their official translations”.

We can also use other transcripts, from Federation Council and regional parliaments for example.

It fits pretty well, because it’s a live speech, and with it we can build really big dataset.

I would check with a local legal expert to confirm this material is under public domain. If that’s the case, yes :slight_smile:

Finally, I found source which we can use without questions. Oral History Foundation http://oralhistory.ru/ provides records of conversations with notable people. Files released on their website is under CC BY-SA 4.0, but they partnered with Wikimedia RU in 2014 and uploaded some of them to Wikimedia Commons under CC0 1.0 https://commons.wikimedia.org/wiki/Category:Oral_History_voice_samples . We can probably use audio files as well, but for now I am going to parse transcriptions and put it in Sentence Collector.

For another source of PD Russian-language text, we can try to use Voice of America content. It’s PD because it’s a work of US govt employee, but it doesn’t fit well as it not a live speech. They had radio broadcast in Soviet-early Russia times which would fit better, but I can’t find archives easily. Here is VOA Russian-language website btw https://www.golos-ameriki.ru .

And I found another big source for Russian sentences. United Nations documents published under PD, and there is already tagged corpus available https://cms.unov.org/UNCorpus/ I wrote a script that extract only proces-verbaux (transcripts) records from corpus and validate it using same method sentence-collector use, and got more than 300k unique sentences. Not all sentences are good, so they need to be validated by human additionally. Should I upload it to sentence-collector fully?

It also can be useful for other United Nations official languages (Arabic, English, Spanish, French, Chinese).