Sentence collection for Belarusian

mytmpaccount2015 · July 7, 2020, 7:07am

Hi,

I’ve recently got started with the sentence collector (I’m reviewing Belarusian sentences), and I have a few questions:

(1) The review tool loads first 10K sentences which haven’t been previously reviewed by me, ordered by upload time. However, in total there are ~100K Belarusian sentences available for review now. So in order to get access to the rest, it would be required to upvote / downvote each of the first 10K sentences, which is not particularly efficient – there is much noise, few sentences are good. I found a way to download all sentences locally and then send upvotes programmatically (with kinto-http, mimicking the logic in sentences-meta.js) to the sentence IDs of my choice. Is that something that I’m allowed to do, or should we instead fix the web UI, adding pagination beyond 10K and sorting / filtering options?

(2) In some of the Belarusian sentences that have already been approved, there are formatting issues – most prominently, Latinic i instead of Belarusian і (U+0456). Again, I’m able to edit the sentences programmatically inside the Kinto collection (example), but I need to know if that is legitimate, or the preferred process is different.

(3) There also exist approved sentences that look bad to me, e.g. those containing dialectal or archaic words that are no longer used in modern standard Belarusian (Фаэтон гоцаў і гутаўся; Ай, каб не дарагоўля — памякчэў ба народ), or those which are not full sentences but rather nominal phrases (Спыненне дзеяння пасведчання аб дзяржаўнай рэгiстрацыi; Выхадныя звесткi друкаваных выданняў). These items cannot be unapproved, as they already have 2 upvotes. Is there anything that can be done about them?

(4) Tatoeba has several thousand Belarusian sentences under CC0, and most of them fit the criteria (i.e. short enough, no digits, etc.). Are there any plans to import CC0 data from Tatoeba in a centralized manner, or is it allowed to upload the CC0 sentences of my choice from Tatoeba into the sentence collector?

Thanks in advance for any comments.

nukeador · July 7, 2020, 10:28am

I’ll let @mkohler reply on the technical questions about the sentence collector, but we want to avoid people using automated processes to approve sentences, if detected this can result into all sentences being invalidated.

How many sentences are we talking about? We have a separate process for over 100K existing corpus + our main recommendation is to first look into other large sources (instead of using the sentence collector for individual sentences)

mytmpaccount2015 · July 7, 2020, 11:58am

@nukeador The latest dump of Tatoeba’s Belarusian CC0 sentences has ~2K sentences. This is a lower bound estimate, because the most prolific contributor of Belarusian data on Tatoeba has declared all his original contributions to be CC0, no matter if they are marked as such (Tatoeba’s CC0 licensing is a new feature, and there is no easy way to apply it retroactively, only sentence by sentence). So more realistically we’re talking about ~5–7K sentences.

For the Belarusian sentences uploaded so far into the sentence collector, I did a breakdown by source:

83590 sentences from anibel.net (anime and manga localized to Belarusian). The site doesn’t seem to indicate clearly that its content is CC0 – maybe the contributor, wiedymi0, could clarify that.
17645 sentences from texts by several Belarusian authors who died more than 75 years ago, so their works are in public domain. Sentence splitting and formatting is rather noisy in these sentences.
153 sentences from one of the state laws, also in public domain.

Given these counts, I think it might be helpful to make use of Tatoeba’s Belarusian CC0 data, which are generally cleaner.

I’ll also take a look at the sentence extractor – maybe it wouldn’t require much effort to scrape the Belarusian Wikipedia. I’m concerned about the license though – iirc, Wikipedia’s textual content is CC-BY-SA. If that was discussed before, I would appreciate if you point me at the relevant discussion.

nukeador · July 7, 2020, 12:03pm

We have a special legal approval for it, and it’s our main and first place we direct people for sentences (the most productive in terms of time investment). See the conditions and process over here:

Adrijaned · July 7, 2020, 1:30pm

If the sentences from anibel.net are generally no good, and you are unsure about licensing, I’d just post in Sentence collector copyright issues and have them deleted straight away completely; if they are generally more or less grammatically fine and acceptable, I’d try to inquire about the licensing a bit more, and only if CC0 can’t be confirmed, I’d have them deleted.

Having archaic and dialectal words in the dataset is not an issue IMO as far as most people are capable of pronouncing them fine - I for one am using such words a lot in my common speach, and so would find a STT ML model not recognizing them… lacking, to say the least.

Only 10k sentences being loaded is technical limitation of Kinto, and while it has its issues, its also for the better - if we take a wild guess and say one sentence may equal to about 100 bytes of JSON, loading all 100K sentences would equal to downloading ~10MB of data on each page load - acceptable, although not great on desktop PCs, much worse on mobile devices connected to mobile networks, for example when you want to review a few sentences while riding a train.

Editing already validated sentences is definitely possible, but should be done only by review by at least one other native speaker, and preferably in an official capacity.

mkohler · July 7, 2020, 2:36pm

I absolutely agree with most of what @Adrijaned said. Except:

We can find a way to do that, but doing it yourself through kinto libs is not the way to go. While this is currently possible, this won’t be possible for long. There are infrastructure changes on the way which won’t allow that anymore. And some bugs such as the mentioned 10k limit will be fixed with that as well.

Happy to delete sentences that are not CC0, you can either write it in the topic @Adrijaned mentioned, or here once we know if it’s a violation or not (though it probably is).

mytmpaccount2015 · July 7, 2020, 3:03pm

Thanks @Adrijaned and @mkohler! To sort out the issue with those doubtfully-licensed sentences, I’ll contact the contributor who added them, and later I’m going to focus on scraping the Belarusian Wikipedia. Still hoping that in the future we will also be able to import CC0 data from Tatoeba, although I understand that it is less relevant at the moment.

mytmpaccount2015 · July 7, 2020, 5:26pm

Update: I contacted wiedymi0, he says the Belarusian texts available on anibel.net are translations from either English or Russian, and doesn’t confirm the originals to be CC0. @mkohler, @Adrijaned – please decide if these sentences should be kept or removed.

mkohler · July 7, 2020, 5:48pm

I’m removing them, as we can’t confirm that those are indeed CC0.