let us have a separate thread for coordinating our efforts to provide input for wiki scrapper. I have already seen some activity in this manner here:https://discourse.mozilla.org/t/polish-sentences-concerns/52136/15
and here: https://discourse.mozilla.org/t/polish-dataset-download/52707 so we can probably speed up the work by doing it together ;).
Before going further please visit https://github.com/Common-Voice/common-voice-wiki-scraper and https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/ if You haven’t already.
According to my knowledge the state is: @Scarfmonster started some activity in his git, which I have forked and added word usage file to prepare blacklist (see here https://github.com/J-Wrobel/common-voice-wiki-scraper/tree/polish)
I am now running additional script which will check these words for number of hits in polish language corpus (https://sjp.pwn.pl/korpus) and create additional file so we can use two sources when creating blacklist.
I think the simplest plan is to create blacklist file and update rules with whatever we find useful. Then create some shuffled sample of sentences for initial approval (preferably polish linguists to give error percentage and remarks for improvement).
Any comments and ideas welcome and let’s do it!