Future of the Sentence Extractor - Your input is required

Here are my thoughts:

  • The more technical details we can abstract, the easier it is for somebody to use it
  • Validation currently happens in a spreadsheet - this could be improved with a common, guided process
  • We really need to fix the issue with it not working for quite some languages we’d eventually want to work

Picking up older ideas and parts of what @ftyers told me, I’ve created the following diagram:

What this would allow to do:

  • Easy configuration of rules via GUI without having to run a lot of tools locally with a preview of how the rules apply to a sample set of sentences
  • Making sure segmentation works for a given language - though with more technical effort needed (not necessarily by the same person as configuring the rules)
  • Guided review process to keep validation easy and high quality
  • Guided submission once validation is done (I’m not super happy with still needing a GitHub account in that process)
  • Once the PR is merged the same process as currently kicks in

I’m a bit torn on the amount of work this would need to get to the finishing line. Is it worth it given that we currently mostly have Wikipedia as a source?

Looking forward to hearing other ideas from all of you!

3 Likes