[Legal] [Sentence extraction] Belarusian texts from euroradio.fm

Aliaksandr · May 18, 2021, 12:10pm

Hello!

I am working on creating a Belarusian text corpus for Common Voice.
We have received permission from Euroradio (https://euroradio.fm) - a large belarusian internet media - to use their texts under CC-0 licence for the Common Voice.

Do we need to put this permission into a formal document for you? If so, how should it look like? Can you provide an example of such document?
Can you please also guide, how the process of uploading their texts into Common Voice should be performed?
We can put all their texts into single file with 1 sentence per line and upload this file to Sentence Collector.
Can we skip the process of validating sentences? All the texts that are published on Euroradio’s web site are checked by professional belarusian linguists - so there should be almost no mistakes and the text quality is good

Thanks!

Em.Lewis-Jong · June 22, 2021, 6:47am

Hello! I’m EM, the Product Lead for CV. So excited to hear about this! I just needed to check in with legal about the corpus guidelines. I promise to get back to you tomorrow with that - @heyhillary can help with 2 + 3.

heyhillary · June 22, 2021, 9:47am

Hello

Thanks for your questions.

For question 2:
Check out this guide on how to add bulk submissions https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission.

For question 3:
We ask that you validate a sample of the sentences. This post explains how the Europal Dataset with Speeches from European Parliament were validated: Using the Europarl Dataset with sentences from speeches from the European Parliament

If you have any questions, we are happy to help.

Em.Lewis-Jong · June 24, 2021, 4:08pm

Hello! We are excited to get you started with these sentences! We can get you a template to share the details of the euroradio agreement the week of July 6th our legal expert is going on leave right now. Thank you

mytmpaccount2015 · July 8, 2021, 10:12am

Hi @Em.Lewis-Jong, any updates on this? If the legal team is available now, could you please share the agreement template with @Aliaksandr? While we’re in progress crawling sentences from the Euroradio website, we’d like to resolve the legal issues as well.

heyhillary · July 13, 2021, 9:54am

Hey, apologies for the delay. We are finalising the agreement template and should get back to you soon.

mytmpaccount2015 · July 14, 2021, 11:01am

Just in case, here is the pull request. Would be great if this bulk submission could be part of the next release, as the Belarusian community is running out of sentences again: the previous 70K dataset has mostly been exhausted by now.

mytmpaccount2015 · July 15, 2021, 2:43pm

Hi, I saw that unfortunately the Euroradio sentences didn’t make their way into yesterday’s release. Just wanted to emphasize that it is still very important for us to make this batch of sentences available for recording as early as possible. The Belarusian community is currently one of the most active on Common Voice, we went from zero to 270 recorded hours in under two months, and we’ve got strong support from major media in the country. To keep up the momentum, more sentences are needed, as ~155K existing sentences (Wikipedia + fiction) have already been recorded by the contributors. We have an informal agreement with Euroradio; please correct if my understanding is wrong – it’s the lack of template and formal approval that blocks the Euroradio sentences submission from going into production.

@heyhillary @phire – please advise: is there any chance that the pull request could be merged ahead of the schedule and rolled out to the website asap? For the Belarusian community, it would be a great motivator, otherwise new sentences and new voices will miss the dataset release at the end of July, they will shift to the next release in half a year, and therefore the activity will drop significantly.

daniel.abzakh · July 18, 2021, 6:39am

Could you share more of your experience? How did your team manage to pull this off?

I used a template in the past that was crafted by our lawyer, here is a post discussing this. Copyright waive form for sentence collection

I attached the template for you.
Of course this still needs the green light from Mozilla @heyhillary @phire Отказ-от-авторского-права.zip (9.8 KB)

heyhillary · July 19, 2021, 9:16am

Hey, I’m so sorry for the delay. The document has to go through some through checks before we can share it. Apologies on the impact on the community organsing.

mytmpaccount2015 · July 19, 2021, 2:01pm

Thank you @heyhillary, and no worries – we will wait until all the legal nuances are sorted out.

Thank you @daniel.abzakh for sharing the template.

Disclaimer: my only responsibility is preparing the sentences, so I’m not fully aware of all aspects of the process. @Aliaksandr_Sh, who leads the team, may be in a better position to advise on this.

Basically, there exists a non-profit entity say.by whose mission is to support and encourage everyday use of Belarusian, both in personal and professional communication. Within their partnership with a few local companies, say.by offer automated assessment of the employees’ level of proficiency in Belarusian. The test platform includes ASR functionality, which is currently in PoC stage. Common Voice is viewed as a convenient vehicle to collect voice data that would improve this component of the platform. After the Wikipedia export was merged, which unblocked Belarusian in Common Voice, say.by reached out to local media with a press release. Publishing the press release to a total audience of several hundred thousand people provided a strong initial inflow of contributors. A dedicated website was rolled out to provide guidance to new participants and track recording / validation progress for Belarusian. There also exist promotional accounts in Instagram and Telegram, and a contributors’ chat.

One more point: since the 2020 events in Belarus, there has been a surge of participation in various non-governmental initiatives. Common Voice is no exception: while Belarusian language is relatively neglected by the state, supporting one’s native tongue by contributing to Common Voice may be regarded as both a moral high ground and a safe way to express one’s views. This is another factor that drives contribution activity for Belarusian.

daniel.abzakh · July 19, 2021, 8:34pm

@mytmpaccount2015 @Aliaksandr_Sh
I appreciate your response!
I actually enjoyed reading your message, it’s very informative.

I shared this with our group responsible for the Abkhazian part of Common Voice.

Could we arrange a meeting between our teams in a Skype/Zoom meeting? We think it would be very beneficial for us to hear more about your experience and how you organize all these efforts?

Sincerely,
Nart.

heyhillary · July 21, 2021, 3:09pm

Hey @mytmpaccount2015 and @Aliaksandr

The good news, the template is now ready for sharing. My deepest apologies for the wait, it had to go through an extensive review process.

Is it possible if you could direct message me your email. For me to share your form for Euradio ?

Once you have the from please let us know if Euradio are happy. Please also share with us who and the contact details of the person signing the form from Euradio.

mytmpaccount2015 · July 21, 2021, 3:29pm

Hi @heyhillary, great news! Just DM’ed the email.

heyhillary · July 21, 2021, 4:00pm

Thank you ! just emailed @Aliaksandr with the contract. Thanks for your patience.

daniel.abzakh · July 22, 2021, 10:55am

@heyhillary

I am also waiting for the template, can I have access to it?

heyhillary · July 22, 2021, 11:16am

Hey Daniel,

Apologies for not initially following up with you.

Could you send me your email via dm please for me to share the template ?

Also, we will be hosting a how-to guide for the template on the refreshed Community Playbook so everyone can access this process.

daniel.abzakh · July 22, 2021, 12:10pm

Hello Hillary,

I just sent you my email via dm.

Thank you,
Daniel.

heyhillary · July 22, 2021, 1:40pm

Hey Daniel,

Cool, just emailed you.

Kind regards,

Hillary

Aliaksandr_Sh · July 29, 2021, 8:30pm

Hi Daniel,

Apologies for the late reply. You can reach me out on telegram @LordOfTheBoards so we could arrange the call.

Basically everything what @mytmpaccount2015 has mentioned is true. And few other details:

the messaging to the audience was just the right one “the future of Belarusan language is in your voice”.
Common Voice website is incredibly user friendly (not taking into account few bugs with voice recordings).

And this was a great teamwork by Belarusan community. And for Belarus it has shown a great social effect.

Topic		Replies	Views
Sentence collection for Belarusian – request for advice Common Voice sentence-collection	16	1198	July 9, 2021
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14429	May 10, 2022
Sentence collection for Belarusian Common Voice sentence-collection	8	1905	July 7, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3761	September 11, 2019
Using the Europarl Dataset with sentences from speeches from the European Parliament Common Voice sentence-collection	61	6186	March 28, 2023

[Legal] [Sentence extraction] Belarusian texts from euroradio.fm

Related topics