We finally prepared a new Basque sentence set thanks to the scrapper. They are 55.031 new sentences.
To create the blacklist we used <20 repetitions criteria. Basque language is additive and uses a lot of suffixes (many similar words with low repetition rate), so bigger numbers just reduced the size of the result but not the quality of it. We got 110.000 de-duplicated sentences, then reduced it using many regular expressions (foreign words, wrong characters/words…) and finally cleaned and fixed it manually (spelling errors, concordance, more foreign words…). 5 people did this work. Here it is the pull request for Basque in the Scraper project: https://github.com/Common-Voice/common-voice-wiki-scraper/pull/95
Finally, a 6th person (a Basque language teacher), revised 550 random sentences of the previous result and found errors in the 2% of the sentences. Some of them, like the lack of some commas, not very important for the aim of this project.
Here it is the result: https://librezale.eus/mediawiki//images/2/2b/CommonVoicerako-esaldiak3-Wikipedia.txt
Which is the next step we should take? Can you load this sentences without making us validate them again 5 by 5 in the Sentence Validator (we already did that work carefully!)?