Here’s a snippet on how we got CC-0 text for Kyrgyz. Full article
TL;DR - We asked a news publication to donate their text.
‘’’
Currently, all Kyrgyz text sentences used for this project come from the well-known Kyrgyz language news source Kloop.kg. The founder of Kloop.kg, Bektour Iskender - a proponent of an open-internet and the Create Commons - allowed use of Kyrgyz language articles from Kloop to be distributed under CC-0. As such, when the user reads a sentence for Kyrgyz Common Voice, they are actually reading news from Kloop.kg. This is a major win for the Kyrgyz language and the open internet, because finding CC-0 text for Common Voice is typically the most difficult task in adding a new language. At least 5,000 different sentences should be initially recorded, and most books and online news (such as BBC Kyrgyz) are not available under CC-0.
After the text was automatically downloaded from Kloop (via this Python script), the text was cleaned (all foreign words, numbers, abbreviations were removed) and sentences of an appropriate length were selected. Ideally each recording should be about 5 seconds long. More text can be added later, such that there is more diversity in the kinds of sentences read. Diversity is important for Common Voice, because good speech technologies should recognize the speech of people speaking with different accents about different topics.
‘’’