Hi, I’ve recently got started with the sentence collector (I’m reviewing Belarusian sentences), and I have a few questions: (1) The review tool loads first 10K sentences which haven’t been previously reviewed by me, ordered by upload time. However, in total there are ~100K Belarusian sentences ava…

I’ll let @mkohler reply on the technical questions about the sentence collector, but we want to avoid people using automated processes to approve sentences, if detected this can result into all sentences being invalidated. [image] mytmpaccount2015: (4) Tatoeba has several thousand Belarusian se…

@nukeador The latest dump of Tatoeba’s Belarusian CC0 sentences has ~2K sentences. This is a lower bound estimate, because the most prolific contributor of Belarusian data on Tatoeba has declared all his original contributions to be CC0, no matter if they are marked as such (Tatoeba’s CC0 licensing …

[image] mytmpaccount2015: I’ll also take a look at the sentence extractor – maybe it wouldn’t require much effort to scrape the Belarusian Wikipedia. I’m concerned about the license though – iirc, Wikipedia’s textual content is CC-BY-SA. If that was discussed before, I would appreciate if you po…

If the sentences from anibel.net are generally no good, and you are unsure about licensing, I’d just post in Sentence collector copyright issues and have them deleted straight away completely; if they are generally more or less grammatically fine and acceptable, I’d try to inquire about the licensi…

I absolutely agree with most of what @Adrijaned said. Except: [image] Adrijaned: Editing already validated sentences is definitely possible, but should be done only by review by at least one other native speaker, and preferably in an official capacity. We can find a way to do that, but doing…

Thanks @Adrijaned and @mkohler ! To sort out the issue with those doubtfully-licensed sentences, I’ll contact the contributor who added them, and later I’m going to focus on scraping the Belarusian Wikipedia. Still hoping that in the future we will also be able to import CC0 data from Tatoeba, althou…

Update: I contacted wiedymi0, he says the Belarusian texts available on anibel.net are translations from either English or Russian, and doesn’t confirm the originals to be CC0. @mkohler , @Adrijaned – please decide if these sentences should be kept or removed.

I’m removing them, as we can’t confirm that those are indeed CC0.

Sentence collection for Belarusian

Common Voice

nukeador (Rubén Martín [❌ taking a break from Mozilla]) July 7, 2020, 12:03pm 4

We have a special legal approval for it, and it’s our main and first place we direct people for sentences (the most productive in terms of time investment). See the conditions and process over here:

Topic		Replies	Views
Sentence collection for Belarusian – request for advice Common Voice sentence-collection	16	1196	July 9, 2021
Polish sentences concerns Common Voice sentence-collection , issue , dataset	20	3352	May 4, 2020
Sentence collector copyright issues Common Voice sentence-collection	54	6398	April 16, 2024
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	34	8975	December 17, 2018
Sentence collection tool development topic Common Voice sentence-collection , announcements	30	4110	January 26, 2019

Sentence collection for Belarusian

Related topics