I forked git repo provided by @Scarfmonster and tried the scrapping. It did pretty well as for first try. I removed ( and ) from allowed symbols as many sentences seemed broken by those (i.e. some details in the middle of sentence surrounded with parenthesis). I will run blacklist operations overnight and will put word statistics in forked repo if the file will be of reasonable size. @Scarfmonster should we make separate topic for work on polish extractor settings? I noticed also this topic: Using the Europarl Dataset with sentences from speeches from the European Parliament - Common Voice - Mozilla Discourse - for future reference.
jakub.wrobel7
(Jakub Wrobel7)
14
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Polish dataset download | 49 | 4853 | April 13, 2020 | |
| Sentence collector copyright issues | 54 | 6401 | April 16, 2024 | |
| About the new English Sentences | 37 | 3484 | May 31, 2019 | |
| Sentence collection for Belarusian – request for advice | 16 | 1198 | July 9, 2021 | |
| Extending our sentence collection capabilities | 19 | 3761 | September 11, 2019 |