Polish dataset from Europarl - help needed

I read Using the Europarl Dataset with sentences from speeches from the European Parliament and I tried to extract the Polish sentences from the dataset.

I based on the other scripts, but I tried to lose not too much, what means:

  • remove the content of brackets, but not the whole sentence;
  • If a line contains more than one sentence, try to split them;
  • sentence doesn’t need to start with a big letter

Rest of rules are very similar with others (like removing person names if possible, removing abbreviations etc.) with one exception, I didn’t remove one-word sentences. The full script is available here (mix python & shell). After automatically extraction, I made cursorily a manual review (removing more personal names, too similar sentences, also a few probably too strong opinions without context etc.). As a result, from about 630k lines of Parallel PL-EN corpus, I extracted 205k sentences. The full dataset is here

Now, I need help with QA. Are here Polish speakers, who want to check test sample? I prepared for review a sheet with 4100 random sentences. I’ll make a first review, but more is needed before I could open PR.

If there is no willing Polish volunteers, should/could I upload those sentences into Sentence Collector, so they will be slowly reviewed case-by-case, but not lost?

1 Like

I’d suggest opening a PR now and making it WIP, if nothing else it should bring a bit more visibility to the extraction.

Thanks, I thought about opening it after QA, but you have right. PR: https://github.com/mozilla/common-voice/pull/2933

I finished my review and it looks ok, I accepted 98% of the sample sentences.

Since I see there is no active Polish contributors here, I would kindly ask for help some previous. So:

@jakub.wrobel7 @Scarfmonster @Etua @madziszyn @aiteam @Tomasz_Zietkiewicz

Sorry for unexpected mention. I see you were previously active in Polish-related threads. If any of you is still interested in Common Voice contributing, I would ask for help in review large Polish dataset. tl;dr: there is 4100 sample sentences (of 200k) that need to be reviewed by 2-3 person in order to add the whole dataset to the CV. It’s just a click in a spreadsheet I prepared. More details are in the first post in this thread. Thanks in advance and sorry again for mentioning.

Hi, I will try next week but no promises :wink:

1 Like

Thanks a lot! And no pressure :slight_smile:

Ok, i’ll try to help but i can’t promise anything…

1 Like

Thanks! I’ll be happy for any help :slight_smile:

Hi, @madziszyn & @jakub.wrobel7 - I’d just remember about the topic (but no pressure :wink: )