Hi everyone
The Sentence Extractor has been around for some time and was used to extract sentences from Wikipedia for several languages. While this process works for some, it doesn’t for others. As of right now I’m seeing the following issues we might want to address:
- It doesn’t work well for certain languages where
rust-punkt
does not correctly segment sentences due to languages not using periods to separate sentences or due to abbreviations not being recognized correctly. - Contributors interested in doing an extract for their language need to do quite a few steps to get their extract incorporated - which also needs quite some technical knowledge
Given that there are still Wikipedias for languages that haven’t been leveraged, I want to start a discussion on how you would like to see this process working out. Additionally there are other sources this process could be used for.
Would be great to have a discussion here around the following question:
In a perfect world, how would you expect the flow to work to extract sentences from sources like Wikipedia?
Note that in the end we will still need to run the export to make sure the legal requirements are met, but anything before that is up for improvement.