I am working on creating a Belarusian text corpus for Common Voice.
We have received permission from Euroradio (https://euroradio.fm) - a large belarusian internet media - to use their texts under CC-0 licence for the Common Voice.
Do we need to put this permission into a formal document for you? If so, how should it look like? Can you provide an example of such document?
Can you please also guide, how the process of uploading their texts into Common Voice should be performed?
We can put all their texts into single file with 1 sentence per line and upload this file to Sentence Collector.
Can we skip the process of validating sentences? All the texts that are published on Euroradioās web site are checked by professional belarusian linguists - so there should be almost no mistakes and the text quality is good
Hello! Iām EM, the Product Lead for CV. So excited to hear about this! I just needed to check in with legal about the corpus guidelines. I promise to get back to you tomorrow with that - @heyhillary can help with 2 + 3.
heyhillary
(Hillary (Community Manager, Common Voice))
3
Hello! We are excited to get you started with these sentences! We can get you a template to share the details of the euroradio agreement the week of July 6th our legal expert is going on leave right now. Thank you
Hi @Em.Lewis-Jong, any updates on this? If the legal team is available now, could you please share the agreement template with @Aliaksandr? While weāre in progress crawling sentences from the Euroradio website, weād like to resolve the legal issues as well.
1 Like
heyhillary
(Hillary (Community Manager, Common Voice))
6
Hey, apologies for the delay. We are finalising the agreement template and should get back to you soon.
Just in case, here is the pull request. Would be great if this bulk submission could be part of the next release, as the Belarusian community is running out of sentences again: the previous 70K dataset has mostly been exhausted by now.
Hi, I saw that unfortunately the Euroradio sentences didnāt make their way into yesterdayās release. Just wanted to emphasize that it is still very important for us to make this batch of sentences available for recording as early as possible. The Belarusian community is currently one of the most active on Common Voice, we went from zero to 270 recorded hours in under two months, and weāve got strong support from major media in the country. To keep up the momentum, more sentences are needed, as ~155K existing sentences (Wikipedia + fiction) have already been recorded by the contributors. We have an informal agreement with Euroradio; please correct if my understanding is wrong ā itās the lack of template and formal approval that blocks the Euroradio sentences submission from going into production.
@heyhillary@phire ā please advise: is there any chance that the pull request could be merged ahead of the schedule and rolled out to the website asap? For the Belarusian community, it would be a great motivator, otherwise new sentences and new voices will miss the dataset release at the end of July, they will shift to the next release in half a year, and therefore the activity will drop significantly.
heyhillary
(Hillary (Community Manager, Common Voice))
10
Hey, Iām so sorry for the delay. The document has to go through some through checks before we can share it. Apologies on the impact on the community organsing.
Disclaimer: my only responsibility is preparing the sentences, so Iām not fully aware of all aspects of the process. @Aliaksandr_Sh, who leads the team, may be in a better position to advise on this.
Basically, there exists a non-profit entity say.by whose mission is to support and encourage everyday use of Belarusian, both in personal and professional communication. Within their partnership with a few local companies, say.by offer automated assessment of the employeesā level of proficiency in Belarusian. The test platform includes ASR functionality, which is currently in PoC stage. Common Voice is viewed as a convenient vehicle to collect voice data that would improve this component of the platform. After the Wikipedia export was merged, which unblocked Belarusian in Common Voice, say.by reached out to local media with a press release. Publishing the press release to a total audience of several hundred thousand people provided a strong initial inflow of contributors. A dedicated website was rolled out to provide guidance to new participants and track recording / validation progress for Belarusian. There also exist promotional accounts in Instagram and Telegram, and a contributorsā chat.
One more point: since the 2020 events in Belarus, there has been a surge of participation in various non-governmental initiatives. Common Voice is no exception: while Belarusian language is relatively neglected by the state, supporting oneās native tongue by contributing to Common Voice may be regarded as both a moral high ground and a safe way to express oneās views. This is another factor that drives contribution activity for Belarusian.
@mytmpaccount2015@Aliaksandr_Sh
I appreciate your response!
I actually enjoyed reading your message, itās very informative.
I shared this with our group responsible for the Abkhazian part of Common Voice.
Could we arrange a meeting between our teams in a Skype/Zoom meeting? We think it would be very beneficial for us to hear more about your experience and how you organize all these efforts?
Sincerely,
Nart.
1 Like
heyhillary
(Hillary (Community Manager, Common Voice))
13
The good news, the template is now ready for sharing. My deepest apologies for the wait, it had to go through an extensive review process.
Is it possible if you could direct message me your email. For me to share your form for Euradio ?
Once you have the from please let us know if Euradio are happy. Please also share with us who and the contact details of the person signing the form from Euradio.