Coordination of input for polish language wiki scrapper

Hello,
let us have a separate thread for coordinating our efforts to provide input for wiki scrapper. I have already seen some activity in this manner here:https://discourse.mozilla.org/t/polish-sentences-concerns/52136/15
and here: https://discourse.mozilla.org/t/polish-dataset-download/52707 so we can probably speed up the work by doing it together ;).

Before going further please visit https://github.com/Common-Voice/common-voice-wiki-scraper and https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/ if You haven’t already.

According to my knowledge the state is: @Scarfmonster started some activity in his git, which I have forked and added word usage file to prepare blacklist (see here https://github.com/J-Wrobel/common-voice-wiki-scraper/tree/polish)
I am now running additional script which will check these words for number of hits in polish language corpus (https://sjp.pwn.pl/korpus) and create additional file so we can use two sources when creating blacklist.

I think the simplest plan is to create blacklist file and update rules with whatever we find useful. Then create some shuffled sample of sentences for initial approval (preferably polish linguists to give error percentage and remarks for improvement).

Any comments and ideas welcome and let’s do it!

Great to see this effort. Happy to help on a technical level if you need any help.

Sorry for being off the topic for some time. I have prepared blacklist files based on word usage frequency and number of hits for a word in national corpus. Now, I wanted to generate sample 200-400 sentences to check error rate but I noticed that @Scarfmonster filed an issue in rust-punkt (see here: https://github.com/ferristseng/rust-punkt/issues/16) which is probably affecting the scrapper. I am not sure what is best way to proceed here. Any thoughts? I am not really familiar with NLP stuff and I doubt I will have much time to get acquainted to help with fix.

Yes, we have the same issue in the scraper :frowning:

In this case I think it would be better to wait for the fix before actually evaluating the sample. I can try to better refine rules while waiting. I would like to avoid being affected by sentence segmentation errors. What do You think? Do You have some information regarding date of the fix?

I wouldn’t block on that. Given that this by now is basically unmaintained, I wouldn’t keep the hopes up. There is https://github.com/Common-Voice/cv-sentence-extractor/issues/11 to look into alternatives though.