Coordination of input for polish language wiki scrapper

jakub.wrobel7 · February 5, 2020, 9:02pm

Hello,
let us have a separate thread for coordinating our efforts to provide input for wiki scrapper. I have already seen some activity in this manner here:https://discourse.mozilla.org/t/polish-sentences-concerns/52136/15
and here: https://discourse.mozilla.org/t/polish-dataset-download/52707 so we can probably speed up the work by doing it together ;).

Before going further please visit https://github.com/Common-Voice/common-voice-wiki-scraper and https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/ if You haven’t already.

According to my knowledge the state is: @Scarfmonster started some activity in his git, which I have forked and added word usage file to prepare blacklist (see here https://github.com/J-Wrobel/common-voice-wiki-scraper/tree/polish)
I am now running additional script which will check these words for number of hits in polish language corpus (https://sjp.pwn.pl/korpus) and create additional file so we can use two sources when creating blacklist.

I think the simplest plan is to create blacklist file and update rules with whatever we find useful. Then create some shuffled sample of sentences for initial approval (preferably polish linguists to give error percentage and remarks for improvement).

Any comments and ideas welcome and let’s do it!

mkohler · February 5, 2020, 10:18pm

Great to see this effort. Happy to help on a technical level if you need any help.

jakub.wrobel7 · April 13, 2020, 2:18pm

Sorry for being off the topic for some time. I have prepared blacklist files based on word usage frequency and number of hits for a word in national corpus. Now, I wanted to generate sample 200-400 sentences to check error rate but I noticed that @Scarfmonster filed an issue in rust-punkt (see here: https://github.com/ferristseng/rust-punkt/issues/16) which is probably affecting the scrapper. I am not sure what is best way to proceed here. Any thoughts? I am not really familiar with NLP stuff and I doubt I will have much time to get acquainted to help with fix.

mkohler · April 13, 2020, 2:32pm

Yes, we have the same issue in the scraper

jakub.wrobel7 · April 13, 2020, 6:01pm

In this case I think it would be better to wait for the fix before actually evaluating the sample. I can try to better refine rules while waiting. I would like to avoid being affected by sentence segmentation errors. What do You think? Do You have some information regarding date of the fix?

mkohler · April 13, 2020, 7:01pm

I wouldn’t block on that. Given that this by now is basically unmaintained, I wouldn’t keep the hopes up. There is https://github.com/Common-Voice/cv-sentence-extractor/issues/11 to look into alternatives though.

jakub.wrobel7 · November 10, 2021, 8:18pm

Wheew, I managed to find some time and initiated rules for Polish wiki-scrapper rules. Adding possibility to use python segmenters was great upgrade - thanks CV team! You can find it here link. If anyone can have a look and review sentences and help improve it that would be great! @kam193, @Scarfm maybe :)?

kam193 · November 11, 2021, 11:30am

Sure! I’ll try to review it soon. I see that in the review sheet, columns for Reviewer 2&3 have all sentences marked with OK - does it mean someone reviewed them or (what I suppose) this is just the default value?

jakub.wrobel7 · November 11, 2021, 2:07pm

Its pre-filled with default ok. It speeds up the review a bit. You can put your nickame in reviewer 2 column. I will put mine later in reviewer 1 to distinguish it.

czw., 11 lis 2021, 12:33 użytkownik kam193 via Mozilla Discourse <notifications@discourse.mozilla.org> napisał:

kam193 · November 12, 2021, 7:58pm

Ok, I’ve done my review. I got 94% rate, the most issues was just because of foreign names. I think it’s pretty good, I don’t remember what the minimum rate is. Sentences looked mostly as high quality, so in my opinion it works well

bozden · November 12, 2021, 8:28pm

Please make sure that the formulae are correct. In the default template they go up to 107th row…

If you are uncomfortable with it, here is a revised version we are using:
Common Voice - Sentence verification process-TEMPLATE-revized.zip (18.2 KB)

jakub.wrobel7 · November 12, 2021, 8:53pm

Thanks @bozden I noticed that and corrected it manually earlier in my version.
@kam193 thanks a lot for checking!

jakub.wrobel7 · November 15, 2021, 10:22pm

Hey @Scarfmonster @Etua @madziszyn @aiteam @Tomasz_Zietkiewicz anyone got a few minutes to take a look at 400 sentence sample from wikipedia scrapping? It takes little time to estimate error rate with Your feedback