I based on the other scripts, but I tried to lose not too much, what means:
remove the content of brackets, but not the whole sentence;
If a line contains more than one sentence, try to split them;
sentence doesn’t need to start with a big letter
Rest of rules are very similar with others (like removing person names if possible, removing abbreviations etc.) with one exception, I didn’t remove one-word sentences. The full script is available here (mix python & shell). After automatically extraction, I made cursorily a manual review (removing more personal names, too similar sentences, also a few probably too strong opinions without context etc.). As a result, from about 630k lines of Parallel PL-EN corpus, I extracted 205k sentences. The full dataset is here
Now, I need help with QA. Are here Polish speakers, who want to check test sample? I prepared for review a sheet with 4100 random sentences. I’ll make a first review, but more is needed before I could open PR.
If there is no willing Polish volunteers, should/could I upload those sentences into Sentence Collector, so they will be slowly reviewed case-by-case, but not lost?
Sorry for unexpected mention. I see you were previously active in Polish-related threads. If any of you is still interested in Common Voice contributing, I would ask for help in review large Polish dataset. tl;dr: there is 4100 sample sentences (of 200k) that need to be reviewed by 2-3 person in order to add the whole dataset to the CV. It’s just a click in a spreadsheet I prepared. More details are in the first post in this thread. Thanks in advance and sorry again for mentioning.
@kam193 I started reviewing and will try to continue bit by bit. Do I put my name in the column like You did? Also some sentences seem a bit awkward because of split, I mark them with D - right?
Yes, it’s right. I think the name in column is useful as a proof that I didn’t check it three times
Thanks a lot for reviewing! I see that you found a few mistakes I missed. About split sentences - I mostly didn’t mark it because I’m not sure if it’s a problem (I hope they are mostly still enough good). But feel free to mark every sentence you find awkward as D, it makes sense to me and it is why we need more than one opinion
One I want to mention is a sentence like (you marked it as B): Maroko nie ma w Saharze Zachodniej żadnej suwerenności, powtarzam
I think it’s correct if we put it in a context like: Maroko nie ma w Saharze Zachodniej żadnej suwerenności, powtarzam! - powiedział XY
But here we don’t have any context, so I think any sentence that may have sense, should be ok.
You may be right about this sentence but for now I will try to make conservative approach and maybe revise it later. I am slowly making progress in my scarce free time
@kam193 I managed to look through all of them. It seems I marked a few more with red but not that much. Overall I feel their quality is pretty descent. Let me know if I can help more. Next two weeks I may be more available for all this.
@jakub.wrobel7 Thank you for your help! It’s fantastic your score is not far away from mine. I think it’s all, the Common Voice team even decided to merge the dataset into CV some time ago, so it’s already used