[Legal] [Sentence extraction] Belarusian texts from euroradio.fm

Hello!

I am working on creating a Belarusian text corpus for Common Voice.
We have received permission from Euroradio (https://euroradio.fm) - a large belarusian internet media - to use their texts under CC-0 licence for the Common Voice.

  1. Do we need to put this permission into a formal document for you? If so, how should it look like? Can you provide an example of such document?

  2. Can you please also guide, how the process of uploading their texts into Common Voice should be performed?
    We can put all their texts into single file with 1 sentence per line and upload this file to Sentence Collector.

  3. Can we skip the process of validating sentences? All the texts that are published on Euroradio’s web site are checked by professional belarusian linguists - so there should be almost no mistakes and the text quality is good

Thanks!

3 Likes

Hello! I’m EM, the Product Lead for CV. So excited to hear about this! I just needed to check in with legal about the corpus guidelines. I promise to get back to you tomorrow with that :slight_smile: - @heyhillary can help with 2 + 3.

Hello :wave:t6:

Thanks for your questions.

For question 2:
Check out this guide on how to add bulk submissions https://github.com/common-voice/common-voice/blob/main/docs/SENTENCES.md#bulk-submission.

For question 3:
We ask that you validate a sample of the sentences. This post explains how the Europal Dataset with Speeches from European Parliament were validated: Using the Europarl Dataset with sentences from speeches from the European Parliament

If you have any questions, we are happy to help.