Polish sentences concerns

jakub.wrobel7 · January 29, 2020, 10:24pm

I forked git repo provided by @Scarfmonster and tried the scrapping. It did pretty well as for first try. I removed ( and ) from allowed symbols as many sentences seemed broken by those (i.e. some details in the middle of sentence surrounded with parenthesis). I will run blacklist operations overnight and will put word statistics in forked repo if the file will be of reasonable size. @Scarfmonster should we make separate topic for work on polish extractor settings? I noticed also this topic: Using the Europarl Dataset with sentences from speeches from the European Parliament - Common Voice - Mozilla Discourse - for future reference.

Topic		Replies	Views
Polish dataset download Common Voice dataset	49	4853	April 13, 2020
Sentence collector copyright issues Common Voice sentence-collection	54	6401	April 16, 2024
About the new English Sentences Common Voice feedback , issue	37	3484	May 31, 2019
Sentence collection for Belarusian – request for advice Common Voice sentence-collection	16	1198	July 9, 2021
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3761	September 11, 2019

Polish sentences concerns

Related topics