Here are my thoughts:
- The more technical details we can abstract, the easier it is for somebody to use it
- Validation currently happens in a spreadsheet - this could be improved with a common, guided process
- We really need to fix the issue with it not working for quite some languages we’d eventually want to work
Picking up older ideas and parts of what @ftyers told me, I’ve created the following diagram:
What this would allow to do:
- Easy configuration of rules via GUI without having to run a lot of tools locally with a preview of how the rules apply to a sample set of sentences
- Making sure segmentation works for a given language - though with more technical effort needed (not necessarily by the same person as configuring the rules)
- Guided review process to keep validation easy and high quality
- Guided submission once validation is done (I’m not super happy with still needing a GitHub account in that process)
- Once the PR is merged the same process as currently kicks in
I’m a bit torn on the amount of work this would need to get to the finishing line. Is it worth it given that we currently mostly have Wikipedia as a source?
Looking forward to hearing other ideas from all of you!