Common rule files for Sentence Collector / Sentence Extractor

mkohler · October 1, 2022, 1:47pm

Hi everyone

I’m splitting out a discussion that started to happen in Sentence Collector - Cleanup before export vs. cleanup on upload. This is about the current status of having several validation/cleanup rule files in the different parts of Common Voice. Having different rule files leads to duplicated work and confusion on how everything fits together. Therefore the proposal is to have one common set of rules per language which can be applied everywhere, reducing complexity and amount of work needed. In this thread we want to explore drawbacks and answer questions that would need to be answered in case we want to follow that path as an end goal.

To recap

Sentence Collector uses validation (to reject sentences) and cleanup (when exporting to the Common Voice repository). In the future, cleanup might or might not happen before saving to the database, this discussion is out of scope for this thread and is handled here.
Sentence Extractor has its own rules, which dictate whether a sentence from a source gets used or not. There is no cleanup. If a sentence does not fulfill the rules, it will not be picked and the script tries to find another sentence that fulfills the rules.
CorporaCreator has some cleanup rules that get applied when creating the dataset
Bulk Submissions do not have any tooling around validation or cleanup

Terminology

Overall, I would suggest to use the following terminology for now, to make sure we’re talking about the same thing:

Validation: sentence gets analyzed and either it’s ok, or gets rejected. If it gets rejected, it will not end up in the database to be further processed.
Cleanup: sentences that passed through initial validation might not be how we want them to be, however there are cases for which rejecting them from the beginning is not the best experience. Therefore we can clean the sentence and still use it, without rejecting it and having the contributor re-submit it.

In essence: Validation is a hard “no” and the sentence doesn’t get used. Cleanup is used to fix smaller issues that can be fixed automatically and the contributor doesn’t need to care about this as it’s not impacting them.

Open questions

Following I’m listing questions that will need to be answered before we could work on the approach of having one rule set for everything. I’ve split them up by generic questions and then also questions regarding specific tools. Some of these questions are process questions, where others are purely technical.

General:

What are the drawbacks of having one rules file for everything?
How can we make sure that we keep flexibility so that we do not necessarily need to apply all rules for every tool? See for example below for Sentence Extractor.
How do we migrate existing rules? How do we decide which Sentence Extractor rules should be used for validation in Sentence Collector, and which ones should rather be cleanups?
What data format would work for all purposes? Currently Sentence Extractor uses TOML, Sentence Collector JS objects and CorporaCreator rules are within the python scripts.
How can we keep the infrastructure around this as lean as possible?
How can we make sure that changes to the rules files in PRs do not have a negative impact on one of the tools?
How do we set this up so that updates don’t require manual deployment of all tools involved?

Sentence Collector:

Sentence Collector requires an error message per validation rule, as it will be displayed to the contributor so they know why a given sentence was not accepted. How do we incorporate this? Should every rule have an error message? How do we migrate existing rules from Sentence Extractor so that they will include an appropriate error message?

Sentence Extractor:

Do we want to do cleanup?
Where do we draw the line between “this is validation, let’s not take this sentence” and “this sentence is good enough, let’s clean it up” if we do?
Should this be applied differently depending on the source? For example for WIkipedia I’d probably not apply any cleanup, however if the data source is a raw text file where we are allowed to pick 100% of sentences, applying cleanup probably would make sense.

CorporaCreator:

Are the CorporaCreator rules still meant to be extended or a legacy?
If they are not legacy, if we added validation processes to the Bulk Submissions, would they become legacy?

Note that I did not look closer into CorporaCreator and do not have any previous experience with it.

Bulk Submission:

Do we want to force rules?
If so, how do we do that?
How do we validate that all rules were applied correctly when submitting a PR?

Conclusion

It’s quite a few questions, but the more we answer the easier the work will be. Looking forward to your input, feedback and thoughts on this! Also, there are probably more questions that will surface during this discussion.

I also want to point out that I’m a volunteer, and therefore my time can be very limited. If I don’t answer, please be patient.

Michael

bozden · October 2, 2022, 12:25pm

I don’t have answers to these. At the beginning, I also thought there should be a single validation, but after getting deeper, I changed my mind towards having different validations can be better.

There are two important things to mention thou - both IMHO:

There must be a cleaning/validation process for bulk submissions, a sample review is not sufficient.
If there are already recorded sentences within the corpus, and if they are processed (cleaned), it might change the meaning/intonation etc.

In general my general idea is a human eye should see what it is done by any process, as I mentioned earlier.

Therefore an “external cleaner script” might be a good idea, for example for bulk submissions. After people scraped public domain resource and got sentences, they might run them through this script, then check the results…

bozden · October 2, 2022, 12:37pm

These are very interesting… Have a look:

First the language specific ones are handling very different issues, secondly, the default process does validation which should be done before adding the sentence to the database, such as numbers, html tag stripping etc. At this stage, it regards possibly valid sentences.

What I thought when I first saw these was, they are for existing bad text-corpus from the very beginning and/or to handle remnants from bulk submissions, to create a clean dataset file.