Sorry for reviving a year-old topic, but I wanted to thank you all for your input.
It took me about 6-7 weeks to correctly form a blacklist and finetune the rules.
To implement what @drzraf suggested, I had to write a bunch of Python scripts, and mainly used the “white-list before black-list” approach and run it iteratively 100+ times. As a side note: Stemmers and Lemma’s also produce ~15-20% wrong results, so I had to re-re-re-clean them also.
It resulted only in 347k sentences (I was expecting 1.5M), unfortunately. It seems, most of the “articles” in Turkish Wikipedia are not proper articles, just lists of articles, places, and such. The main article groups are Islam, sports, medicine, animals & plants, history, Turkic states, cities, and even small towns… There were too many Arabic and Latin-based words (mostly not correct spelling), way too many typos, and “invented” words that appear to be because of low language knowledge.
The random/indeterministic selection forced me to work on the whole possible sentences (3 words or more, with a certain min-sentence-length). And due to agglutinative nature of Turkish, I could not strip out low frequency words as most of the correct words have frequency 1.
I could not get the error rate below 1%, our rather extensive testing resides between 2% and 3%. This is due to many sentences with bad grammar, caused by bad translations, bad punctuation, incomplete sentences, many Google-translated articles where sentences make no sense, badly borrowed words from other languages, and words unknown to testers (because I white-listed them to get more domains), etc.
One thing I did is to allow words from multiple domains, medicine or such, as Common Voice does not seem to include a domain-specific corpora feature in the foreseeable future.
So, bulk adding sentences issue which @ftyers pointed out, became of lesser importance. It will take a couple of years with our current speed, and I’ll be adding sentences from other sources as well to dilute these…
I documented the process rather extensively with the PR, for people who will be working on cv-sentence-extractor in the future.
Thank you again @mkohler, @drzraf, and @ftyers… It was a pain