Polish dataset from Europarl - help needed

I read Using the Europarl Dataset with sentences from speeches from the European Parliament and I tried to extract the Polish sentences from the dataset.

I based on the other scripts, but I tried to lose not too much, what means:

  • remove the content of brackets, but not the whole sentence;
  • If a line contains more than one sentence, try to split them;
  • sentence doesn’t need to start with a big letter

Rest of rules are very similar with others (like removing person names if possible, removing abbreviations etc.) with one exception, I didn’t remove one-word sentences. The full script is available here (mix python & shell). After automatically extraction, I made cursorily a manual review (removing more personal names, too similar sentences, also a few probably too strong opinions without context etc.). As a result, from about 630k lines of Parallel PL-EN corpus, I extracted 205k sentences. The full dataset is here

Now, I need help with QA. Are here Polish speakers, who want to check test sample? I prepared for review a sheet with 4100 random sentences. I’ll make a first review, but more is needed before I could open PR.

If there is no willing Polish volunteers, should/could I upload those sentences into Sentence Collector, so they will be slowly reviewed case-by-case, but not lost?

I’d suggest opening a PR now and making it WIP, if nothing else it should bring a bit more visibility to the extraction.

Thanks, I thought about opening it after QA, but you have right. PR: https://github.com/mozilla/common-voice/pull/2933

I finished my review and it looks ok, I accepted 98% of the sample sentences.

Since I see there is no active Polish contributors here, I would kindly ask for help some previous. So:

@jakub.wrobel7 @Scarfmonster @Etua @madziszyn @aiteam @Tomasz_Zietkiewicz

Sorry for unexpected mention. I see you were previously active in Polish-related threads. If any of you is still interested in Common Voice contributing, I would ask for help in review large Polish dataset. tl;dr: there is 4100 sample sentences (of 200k) that need to be reviewed by 2-3 person in order to add the whole dataset to the CV. It’s just a click in a spreadsheet I prepared. More details are in the first post in this thread. Thanks in advance and sorry again for mentioning.

Hi, I will try next week but no promises :wink:

Thanks a lot! And no pressure :slight_smile:

Ok, i’ll try to help but i can’t promise anything…

Thanks! I’ll be happy for any help :slight_smile:

Hi, @madziszyn & @jakub.wrobel7 - I’d just remember about the topic (but no pressure :wink: )

@kam193 I started reviewing and will try to continue bit by bit. Do I put my name in the column like You did? Also some sentences seem a bit awkward because of split, I mark them with D - right?

Yes, it’s right. I think the name in column is useful as a proof that I didn’t check it three times :slight_smile:

Thanks a lot for reviewing! I see that you found a few mistakes I missed. About split sentences - I mostly didn’t mark it because I’m not sure if it’s a problem (I hope they are mostly still enough good). But feel free to mark every sentence you find awkward as D, it makes sense to me and it is why we need more than one opinion :wink:

One I want to mention is a sentence like (you marked it as B):
Maroko nie ma w Saharze Zachodniej żadnej suwerenności, powtarzam

I think it’s correct if we put it in a context like:
Maroko nie ma w Saharze Zachodniej żadnej suwerenności, powtarzam! - powiedział XY

But here we don’t have any context, so I think any sentence that may have sense, should be ok.

You may be right about this sentence but for now I will try to make conservative approach and maybe revise it later. I am slowly making progress in my scarce free time :wink:

