We’d like to ask for your advice on collecting additional sentences for Belarusian. After @Aliaksandr finalized the pull request and a local team of enthusiasts launched a volunteering campaign, the number of clips in Belarusian increases at a steady rate of >10K per day. Likely, in a few days from now each sentence in the Belarusian Wikipedia extractions dump, which is 85K sentences, will be recorded at least once (assuming the least-recorded sentences come first in the queue). Exhausting the supply of sentences isn’t a problem by itself, as robustness of the ASR system would improve if there are many recordings per sentence in the training data. However, we’re concerned about lexical and grammatical diversity: the Wikipedia data don’t cover a range of important phenomena, such as interrogative and imperative sentences, colloquialisms, etc. And also the volunteers may get bored after a while if the dataset is not expanded.
Let me briefly describe the current situation with Belarusian sentence collection:
- Except Wikipedia, we’re not aware of any other sources of CC0-licensed Belarusian texts that would be large enough for bulk import via the sentence extractor.
- There is some work in progress on importing sentences from the media portal Euroradio, but the legal agreement ensuring CC0 will be ready no earlier than July 6th (details here).
- In the sentence collector, there are ~18K Belarusian sentences that haven’t yet been validated, mostly from fiction books written in the first half of the 20th century (therefore, public domain). Many of them are noisy: there are OCR errors (such as Latinic “i” instead of Belarusian “і”), sentence splitting issues, fancy proper names, words no longer used in modern standard Belarusian, etc. Reviewing these sentences manually wouldn’t be particularly effective, as most sentences would be downvoted.
- We’re able to prepare quickly a cleaner sample of sentences from old fiction books in Belarusian available at knihi.com.
- Do you think we should focus on reviewing the sentences which are currently in the sentence collector, or making a cleaner sample?
- If the former: Is it at least possible to replace Latinic “i” (U+0069 lowercase, U+0049 uppercase) with Belarusian “і” (U+0456, U+0406) everywhere in the sentences for review?
- If the latter: Should we still upload the new sample into the sentence collector and then review one by one? Or should we follow the bulk upload procedure described here, i.e. review a subset of sentences and then send a PR?
- Is my understanding correct that the next export from the sentence collector is scheduled for June 30th?
Thanks in advance for any comments. We’re really interested in adding more Belarusian sentences asap, and we would appreciate your guidance.