Sentence collection for Belarusian – request for advice

mytmpaccount2015 · June 24, 2021, 8:31pm

We’d like to ask for your advice on collecting additional sentences for Belarusian. After @Aliaksandr finalized the pull request and a local team of enthusiasts launched a volunteering campaign, the number of clips in Belarusian increases at a steady rate of >10K per day. Likely, in a few days from now each sentence in the Belarusian Wikipedia extractions dump, which is 85K sentences, will be recorded at least once (assuming the least-recorded sentences come first in the queue). Exhausting the supply of sentences isn’t a problem by itself, as robustness of the ASR system would improve if there are many recordings per sentence in the training data. However, we’re concerned about lexical and grammatical diversity: the Wikipedia data don’t cover a range of important phenomena, such as interrogative and imperative sentences, colloquialisms, etc. And also the volunteers may get bored after a while if the dataset is not expanded.

Let me briefly describe the current situation with Belarusian sentence collection:

Except Wikipedia, we’re not aware of any other sources of CC0-licensed Belarusian texts that would be large enough for bulk import via the sentence extractor.
There is some work in progress on importing sentences from the media portal Euroradio, but the legal agreement ensuring CC0 will be ready no earlier than July 6th (details here).
In the sentence collector, there are ~18K Belarusian sentences that haven’t yet been validated, mostly from fiction books written in the first half of the 20th century (therefore, public domain). Many of them are noisy: there are OCR errors (such as Latinic “i” instead of Belarusian “і”), sentence splitting issues, fancy proper names, words no longer used in modern standard Belarusian, etc. Reviewing these sentences manually wouldn’t be particularly effective, as most sentences would be downvoted.
We’re able to prepare quickly a cleaner sample of sentences from old fiction books in Belarusian available at knihi.com.

Please advise:

Do you think we should focus on reviewing the sentences which are currently in the sentence collector, or making a cleaner sample?
If the former: Is it at least possible to replace Latinic “i” (U+0069 lowercase, U+0049 uppercase) with Belarusian “і” (U+0456, U+0406) everywhere in the sentences for review?
If the latter: Should we still upload the new sample into the sentence collector and then review one by one? Or should we follow the bulk upload procedure described here, i.e. review a subset of sentences and then send a PR?
Is my understanding correct that the next export from the sentence collector is scheduled for June 30th?

Thanks in advance for any comments. We’re really interested in adding more Belarusian sentences asap, and we would appreciate your guidance.

mkohler · June 24, 2021, 8:43pm

That really depends on the actual quality. Can’t judge that really. What’s the rough ratio of sentences that are useful?

Sure. Anything else? I’m also happy to delete sentences if you know of a pattern that is easy to detect and would help a lot.

Depends on the size: common-voice/docs/SENTENCES.md at main · common-voice/common-voice · GitHub

Yes, weekly on Wednesdays.

mytmpaccount2015 · June 26, 2021, 7:18am

Thank you @mkohler for these recommendations. At the moment, we have ~80K new sentences exported from public domain Belarusian fiction; this dataset has some overlap with the backlog of Belarusian sentence collector (as it basically covers a wider range of texts by the same authors), but the sentences are much cleaner now. The export procedure is described here. We haven’t started validating sample sentences yet. Could you please let us know if there’s still any chance to complete this bulk submission before the next release of sentences? E.g. if we validate a 4K sample by Monday, June 28th, and the error rate is reasonably low, could we expect that the PR is accepted before the upcoming release?

mkohler · June 26, 2021, 11:01am

I can’t talk in @phire’s name, so I can’t guarantee that. The next Common Voice release happens on June 30th, so it would need to be merged by then. What I can say however is the earlier the PR is submitted the bigger the chances are that there is enough time for it to be merged before that deadline.

Happy to see that extract-file from the Sentence Extractor is useful! This might also be interesting for others, do you want to create a new topic here on Discourse to post this?

mytmpaccount2015 · June 26, 2021, 2:12pm

Thanks! We’re now in progress with the verification; from what I saw so far, most sentences look nice, so I would anticipate the error rate to be below the threshold. I’m going to prepare the pull request later today, mark it as WIP and then update it once the QA is complete.

Not sure if my example of running the Sentence Extractor in extract-file mode contributes anything new to the discussion. Also, I noticed it to be really slow, like 2 hours to process 30 MiB of text. Next week, if time allows, I’m going to investigate this.

One more question, to be on the safe side with CC0 licensing. We’ve harvested the sentences from texts written by authors who died 70+ years ago, so the texts themselves are in public domain. But the online library knihi.com, which digitized the texts and put them on the Web, suggests that using their materials should be acknowledged. Do you think this requirement could potentially be a blocker, or it would be OK if we give proper credit to knihi.com in the pull request, e.g. in the commit message?

mkohler · June 26, 2021, 2:43pm

@heyhillary I’m redirecting this question to you

mytmpaccount2015 · June 27, 2021, 9:56pm

Meanwhile we’ve done some additional filtering, leaving 69K sentences in consideration, and validated a 4K sample thereof. Their quality is high, with around 2% of errors, which is better than Wikipedia. Here is the pull request.

@heyhillary – could you please advise on the above, i.e. should we contact the library and clarify how exactly their contribution is to be credited, or maybe it’s not that important from the licensing perspective? Do you see any other potential issues regarding the CC0 status of our source texts?

@phire – just in case you happen to have time before Wednesday, June 30th, we would appreciate if you take a look at the PR #3161. Although the chances are slight, we would be very happy if this bulk submission of Belarusian sentences could be part of the upcoming release.

I’d like to reiterate (if I may) that it’s really important for the community: volunteering campaign for Belarusian was announced a week ago by a major local media, and currently several thousand people are donating their voices. We’re quickly running out of sentences, and while the activity is still high, each day counts.

Thanks again.

heyhillary · June 28, 2021, 9:17am

Hiya,

I will have to check in with Mozilla legal team, to confirm if you can use the library. Is it possible if you could share the web-page that details this ?

Many thanks,

Hillary

mytmpaccount2015 · June 28, 2021, 10:03am

Sure, please find attached two files in a ZIP archive:
attachments.zip (23.2 KB)

public-domain-authors-70.tsv is the list of authors whose texts, published during their lifetime, we consider to be in public domain. Links to knihi.com index pages were added manually for those authors whose texts are available.
knihi-com-source-text-links.txt is the list of links to the texts. In the source code of these pages, there are metadata provided by the library as HTML comments. E.g. for the first link these are the metadata:

<!-- HEADER_FIELD Authors: Адам Бабарэка -->
<!-- HEADER_FIELD CreationYear: 1922 -->
<!-- HEADER_FIELD Edition: невядомае -->
<!-- HEADER_FIELD FirstPublicationYear: 1923 -->
<!-- HEADER_FIELD Pravapis: A1957 -->
<!-- HEADER_FIELD PublicationYear: 1923? -->
<!-- HEADER_FIELD StyleGenre: мастацкі/апавяданне -->
<!-- HEADER_FIELD Title: Як красназорцы зямлі сцураліся -->

The site footer says:

That is:

Беларуская Палічка ‘Belarusian Bookshelf’ is the library name.

mytmpaccount2015 · June 28, 2021, 10:25am

Update: I’ve checked once again - we’re in fact using sentences from a subset of these texts, 673 out of 2097. Attaching the list of links once again.
knihi.com-source-text-links-subset.zip (8.4 KB)

heyhillary · June 28, 2021, 11:21am

Thank you for sharing the links !

heyhillary · June 30, 2021, 9:19am

Hiya @mytmpaccount2015

Hey, no worries at the moment legal council are on annual leave and will be back on 6th July. So they will be a delay in response, sorry for this. To note the next dataset release is at the end of July. Is it possible if you could reach out to the publishers to ask for clarity regarding, how they want to be recognised as I noticed you suggest the idea of possibly recognising them via pull request ?

I’ve linked below a similar discussion regarding:

mytmpaccount2015 · June 30, 2021, 7:18pm

Just for consistency, repeating here the message I posted a few hours ago in Matrix:
We got feedback from Andruś Žvir, the administrator of knihi.com. He is OK with the acknowledgement in the pull request. Hope this solves the issue.

mytmpaccount2015 · July 8, 2021, 11:27am

Hi @mkohler, sorry for upping this thread once again – a couple of remarks related to the above discussion:

Now that the bulk submission of sentences from old Belarusian fiction has been merged, it should be safe to remove sentences by Kuźma Čorny (Кузьма Чорны) and Maksim Harecki (Максім Гарэцкі) from the sentence collector backlog. The bulk submission contains a much cleaner sample of sentences by these authors.
Regarding the performance of cv-sentence-extractor, I found the processing speed to be much higher with a release build, like this:
cargo run --release -- extract-file -l be -d /path/to/input/dir > /path/to/output/file
In case other people independently verify this, it might be a good idea to update the documentation accordingly.

mkohler · July 8, 2021, 4:56pm

Will do.

That does sound reasonable. Would you like to create a PR for this?

mytmpaccount2015 · July 9, 2021, 12:30pm

Done. Please note that I’ve edited only the section on extract-file mode, as I don’t have any performance comparison for the other two modes.

mkohler · July 9, 2021, 7:38pm

Thank you. For the other two it’s substantial as well, so I went ahead and added it to every command in the README, as well as the scripts that run in the pipeline. I can’t believe I didn’t notice this earlier, would have saved many hours Great work!

Topic		Replies	Views
Sentence collection for Belarusian Common Voice sentence-collection	8	1902	July 7, 2020
Sentence collector copyright issues Common Voice sentence-collection	54	6398	April 16, 2024
Polish sentences concerns Common Voice sentence-collection , issue , dataset	20	3352	May 4, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3761	September 11, 2019
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14426	May 10, 2022

Sentence collection for Belarusian – request for advice

Related topics