According to my knowledge the state is: @Scarfmonster started some activity in his git, which I have forked and added word usage file to prepare blacklist (see here https://github.com/J-Wrobel/common-voice-wiki-scraper/tree/polish)
I am now running additional script which will check these words for number of hits in polish language corpus (https://sjp.pwn.pl/korpus) and create additional file so we can use two sources when creating blacklist.
I think the simplest plan is to create blacklist file and update rules with whatever we find useful. Then create some shuffled sample of sentences for initial approval (preferably polish linguists to give error percentage and remarks for improvement).
Sorry for being off the topic for some time. I have prepared blacklist files based on word usage frequency and number of hits for a word in national corpus. Now, I wanted to generate sample 200-400 sentences to check error rate but I noticed that @Scarfmonster filed an issue in rust-punkt (see here: https://github.com/ferristseng/rust-punkt/issues/16) which is probably affecting the scrapper. I am not sure what is best way to proceed here. Any thoughts? I am not really familiar with NLP stuff and I doubt I will have much time to get acquainted to help with fix.
In this case I think it would be better to wait for the fix before actually evaluating the sample. I can try to better refine rules while waiting. I would like to avoid being affected by sentence segmentation errors. What do You think? Do You have some information regarding date of the fix?
Wheew, I managed to find some time and initiated rules for Polish wiki-scrapper rules. Adding possibility to use python segmenters was great upgrade - thanks CV team! You can find it here link. If anyone can have a look and review sentences and help improve it that would be great! @kam193, @Scarfm maybe :)?
Sure! I’ll try to review it soon. I see that in the review sheet, columns for Reviewer 2&3 have all sentences marked with OK - does it mean someone reviewed them or (what I suppose) this is just the default value?
Its pre-filled with default ok. It speeds up the review a bit. You can put your nickame in reviewer 2 column. I will put mine later in reviewer 1 to distinguish it.
Ok, I’ve done my review. I got 94% rate, the most issues was just because of foreign names. I think it’s pretty good, I don’t remember what the minimum rate is. Sentences looked mostly as high quality, so in my opinion it works well