Polish sentences concerns

I forked git repo provided by @Scarfmonster and tried the scrapping. It did pretty well as for first try. I removed ( and ) from allowed symbols as many sentences seemed broken by those (i.e. some details in the middle of sentence surrounded with parenthesis). I will run blacklist operations overnight and will put word statistics in forked repo if the file will be of reasonable size. @Scarfmonster should we make separate topic for work on polish extractor settings? I noticed also this topic: Using the Europarl Dataset with sentences from speeches from the European Parliament - Common Voice - Mozilla Discourse - for future reference.