I have remembered that after generating the blacklist and before creating the scraper sentence list, we used some regular expressions just to reduce the automatic blacklist, because it contained more Basque valid words than invalid words. The reason was, as I explained, the additive properties of the language, which causes a lot of low repeated words.
So I think it would be useful for other languages with similar properties to Basque, interested in using the Wikipedia Scraper, a feature of “whitelist” or the possibility to define a list of regular expressions that avoid some kinds of words to be included in the blacklist. For example, many suffixes provoke a lot of Basque words to be included in the blacklist and the same regular expressions I used to make a manual clean, could be included in a configuration file: *gatik, *ganako, *rentzako, *rentzat, *rekin, *renganako… I used a lot and some of then I checked manually because there was a possibility of giving false positives: *ren, *ri, *ra… Obviously, the last ones can’t be included in this hypothetical parameter.
If you see it interesting for some languages different to Basque, I can create an Issue in the GitHub project, so other people can benefit of it.