From my point of view, the issue is just that the scraper doesn’t really fit to languages like Basque. I don’t know German, but as far as I know, suffixes are common too, so perhaps German speakers can understand better the problem I’m trying to explain or can explain it better than me to other language speakers. I think I’m not explaining myself clear enough because the subject is complex and abstract.
I’m developer, I work on R&D and I don’t see a way to address Basque language needs in a programmatic way to get a good Wikipedia Scrapper. I think trying to find hundreds of regular expressions for the blacklist and for the result sentence list, is a very wrong approach. Many regular expressions would be extremely difficult to define. Many regular expressions would affect just to some little sentences, sometimes just one. That’s way I think that trying to apply JUST a programmatic strategy in all the cases will discriminate Basque language (and probably other languages if they try to use the scraper to make part of the work).
If you need to “run the extraction from your side” the only solution I see is that you just execute the scrapper with the config included in our pull request, and then match the lines of your result with the lines of our result. The lines that are equal in both sides, are good sentences and you can be sure they are inside Basque Wikipedia. This way, we will lose all the hours dedicated to make spell corrections, typos, orthography, grammar, capitalisation, commas, etc. Also the very occasional substitutions we made of some problematic foreign words with Basque ones (I don’t remember any right now, but perhaps something like this: Dostoevsky --> Etxeberria).
The people behind the Basque sentence compilation (basically me), can’t repeat the work done during the last months, and I see the programmatic approach a dead-end street, so right now I don’t see any other option to save at least part of the work done during the last months and get all the new sentences that Basque voice recordings need.