Polish dataset from Europarl - help needed

kam193 · November 15, 2020, 2:58pm

I read Using the Europarl Dataset with sentences from speeches from the European Parliament and I tried to extract the Polish sentences from the dataset.

I based on the other scripts, but I tried to lose not too much, what means:

remove the content of brackets, but not the whole sentence;
If a line contains more than one sentence, try to split them;
sentence doesn’t need to start with a big letter

Rest of rules are very similar with others (like removing person names if possible, removing abbreviations etc.) with one exception, I didn’t remove one-word sentences. The full script is available here (mix python & shell). After automatically extraction, I made cursorily a manual review (removing more personal names, too similar sentences, also a few probably too strong opinions without context etc.). As a result, from about 630k lines of Parallel PL-EN corpus, I extracted 205k sentences. The full dataset is here

Now, I need help with QA. Are here Polish speakers, who want to check test sample? I prepared for review a sheet with 4100 random sentences. I’ll make a first review, but more is needed before I could open PR.

If there is no willing Polish volunteers, should/could I upload those sentences into Sentence Collector, so they will be slowly reviewed case-by-case, but not lost?

Adrijaned · November 15, 2020, 7:25pm

I’d suggest opening a PR now and making it WIP, if nothing else it should bring a bit more visibility to the extraction.

kam193 · November 15, 2020, 7:44pm

Thanks, I thought about opening it after QA, but you have right. PR: https://github.com/mozilla/common-voice/pull/2933

kam193 · November 21, 2020, 9:30pm

I finished my review and it looks ok, I accepted 98% of the sample sentences.

Since I see there is no active Polish contributors here, I would kindly ask for help some previous. So:

@jakub.wrobel7 @Scarfmonster @Etua @madziszyn @aiteam @Tomasz_Zietkiewicz

Sorry for unexpected mention. I see you were previously active in Polish-related threads. If any of you is still interested in Common Voice contributing, I would ask for help in review large Polish dataset. tl;dr: there is 4100 sample sentences (of 200k) that need to be reviewed by 2-3 person in order to add the whole dataset to the CV. It’s just a click in a spreadsheet I prepared. More details are in the first post in this thread. Thanks in advance and sorry again for mentioning.

jakub.wrobel7 · November 22, 2020, 9:02am

Hi, I will try next week but no promises

kam193 · November 22, 2020, 10:59am

Thanks a lot! And no pressure

madziszyn · November 23, 2020, 2:40pm

Ok, i’ll try to help but i can’t promise anything…

kam193 · November 23, 2020, 4:32pm

Thanks! I’ll be happy for any help

kam193 · December 3, 2020, 8:04pm

Hi, @madziszyn & @jakub.wrobel7 - I’d just remember about the topic (but no pressure )

jakub.wrobel7 · December 9, 2020, 9:53pm

@kam193 I started reviewing and will try to continue bit by bit. Do I put my name in the column like You did? Also some sentences seem a bit awkward because of split, I mark them with D - right?

kam193 · December 9, 2020, 10:09pm

Yes, it’s right. I think the name in column is useful as a proof that I didn’t check it three times

Thanks a lot for reviewing! I see that you found a few mistakes I missed. About split sentences - I mostly didn’t mark it because I’m not sure if it’s a problem (I hope they are mostly still enough good). But feel free to mark every sentence you find awkward as D, it makes sense to me and it is why we need more than one opinion

One I want to mention is a sentence like (you marked it as B):
Maroko nie ma w Saharze Zachodniej żadnej suwerenności, powtarzam

I think it’s correct if we put it in a context like:
Maroko nie ma w Saharze Zachodniej żadnej suwerenności, powtarzam! - powiedział XY

But here we don’t have any context, so I think any sentence that may have sense, should be ok.

jakub.wrobel7 · January 17, 2021, 8:48pm

You may be right about this sentence but for now I will try to make conservative approach and maybe revise it later. I am slowly making progress in my scarce free time

jakub.wrobel7 · July 16, 2021, 12:20pm

@kam193 I managed to look through all of them. It seems I marked a few more with red but not that much. Overall I feel their quality is pretty descent. Let me know if I can help more. Next two weeks I may be more available for all this.

kam193 · July 17, 2021, 1:54pm

@jakub.wrobel7 Thank you for your help! It’s fantastic your score is not far away from mine. I think it’s all, the Common Voice team even decided to merge the dataset into CV some time ago, so it’s already used

stergro · July 17, 2021, 2:01pm

Hey, the polish Europarl dataset is already reviewed and has been merged in april: