I read Using the Europarl Dataset with sentences from speeches from the European Parliament and I tried to extract the Polish sentences from the dataset.
I based on the other scripts, but I tried to lose not too much, what means:
- remove the content of brackets, but not the whole sentence;
- If a line contains more than one sentence, try to split them;
- sentence doesn’t need to start with a big letter
Rest of rules are very similar with others (like removing person names if possible, removing abbreviations etc.) with one exception, I didn’t remove one-word sentences. The full script is available here (mix python & shell). After automatically extraction, I made cursorily a manual review (removing more personal names, too similar sentences, also a few probably too strong opinions without context etc.). As a result, from about 630k lines of Parallel PL-EN corpus, I extracted 205k sentences. The full dataset is here
Now, I need help with QA. Are here Polish speakers, who want to check test sample? I prepared for review a sheet with 4100 random sentences. I’ll make a first review, but more is needed before I could open PR.
If there is no willing Polish volunteers, should/could I upload those sentences into Sentence Collector, so they will be slowly reviewed case-by-case, but not lost?