OK, I’ll try to explain myself better
Using the scrapper with the blacklist and some regular expressions for abbreviations and some more words, we soon saw that this way we weren’t able to achieve an acceptable result for Basque. After filtering strange characters, etc. about 50 % of the sentences were problematic, no matter too much which repetition criteria we choose. That’s because Basque Wikipedia isn’t as big as other ones, so many articles are quite short and full of people and place names, and because Basque language itself is composed of many longs words with low repetition rates: verb declension, re-declension, composed words, prefixes, a pile of suffixes… The problem was that many foreign words in the sentences didn’t follow Basque language rules and will ruin the phonetic models so we needed to remove them (we have kept the sentences with foreign words compatible with Basque).
We needed to divide the work between the volunteers and most of them didn’t know regular expressions, so I took a text editor and made a first clean using about 200-300 searches using regular expressions of just words or parts of words used in problematic foreign words. Remember that Basque uses a lot of prefixes and suffixes, so creating perfect regular expressions for all the cases would be a nightmare. This way, I just decided on the go if each sentence should be removed or not, and many potential regular expressions went disappearing as the wrong sentences where less and less. Thanks to this iterative process, I removed many sentences in a short time (but I didn’t use a list of perfect regular expressions that can be reused).
Then, other volunteers continued the work, fixing sentences with spell problems, removing of changing sentences with problematic foreign words, etc. Finally, from a collection of 110.000 de-duplicated sentences gave by the scrapper, we got 55.000 validated sentences. Did I explained myself now? Do you understand how after a horrible beginning (about 50% wrong), we got a quite good ending (about 1-2% wrong)?
I don’t know if the steps we took were the best, the optimal, etc. I just know that Basque Wikipedia is quite little and we got many new sentences for Common Voice project.