For any sufficiently simple rule files, sure. But even then I see a few issues. I’ve spent some time looking deeper into this to provide as much info as possible. I’ll admit, I did not spend as much time initially to review this. This is also gonna get a bit longer as I’d like this to serve as bug report as well. Let’s also consider that this is a first version of the rewrite and improvements certainly are already planned. This might lead to some frustration, but overall contributors might just choose another sentence to submit and then it works. I absolutely have no data on this. Also, even though it’s not fully correct, I really appreciate the requirement list as that is a great improvement over the previous version of the Sentence Collector.
There are 8 bullet point items in that list. Arguably only 3 of them are actually dynamic (fewer than 15 words, no numbers and special characters, and no foreign letters). The others easily can be seen as valid for all the languages.
Problem statement
1) Remapping by translation is not possible because there is no translatable error key
The default file has the following validations and they only partially map against the actual requirements:
- ERR_TOO_LONG -> “Fewer than 15 words”
- ERR_NO_NUMBERS -> “No numbers and special characters”
- ERR_NO_SYMBOLS -> “No numbers and special characters”
- ERR_NO_ABBREVIATIONS -> Leads to an error, but does not show which one
- (unused in default file) -> No foreign letters
This shows the first two problems in the default validator. Even though we check for abbreviations by default, we do not have a special requirement in the list. I’d argue that abbreviations are common enough to warrant inclusion in the main list. For the default file we could replace “No foreign letters” with “No abbreviations”. However the ERR_NO_ABBREVIATIONS
error code is not mapped to that “No foreign letters” item, so it won’t yet get marked in red if an error occurs. So basically now we’re half-way through, we have the requirement in the list, but we do not explicitly highlight it on error.
What we could do now is remap the error code in the validation file from ERR_NO_ABBREVIATIONS
to ERR_NO_FOREIGN_SCRIPT
.
This would lead us to the following, working rule set:
{
type: 'regex',
regex: /[A-Z]{2,}|[A-Z]+\.*[A-Z]+/,
error: `${TRANSLATION_KEY_PREFIX}sc-validation-no-abbreviations`,
errorType: ERR_NO_FOREIGN_SCRIPT,
},
I would heavily argue against this, as this is purely confusing and looks like a bug. However, for the default rules file this would somewhat work out and we could cover all cases.
More complex cases as described further down still would not be possible, except if you keep the OTHER category and map it against a very generic requirement string. But then it kinda loses its power.
2) Translators do not necessarily have technical knowledge
Not all translators using Pontoon have a technical background. Even if somebody can provide “knowledgeable” translations, I’m fairly sure that not all translators would either be aware of the individual rule files, and/or fully understand them. We should not assume technical knowledge required to contribute translations. The current versions of the rules files, if somebody finds the actual file on GitHub, have actual spelled out error messages (which possibly are unused). This might help but still needs a lot of hurdles to be crossed first.
3) More complex validation files
The default validation file is rather simple. We currently have 20 language validation files, the default one being one of them. Out of these 20 files 10 have a ERR_OTHER
error type which is not explicitly marked as requirement. These range very far in what they represent, and it’s almost impossible to know exactly what failed if you happen to submit a sentence that fails that validation. Granted these are likely edge cases, but still can be very frustrating to not have any feedback apart from a generic error when submitting.
There was a lot of effort put into the Thai validation file, resulting in 23 different rules in the validation file: https://github.com/common-voice/common-voice/blob/main/server/src/core/sentences/validation/languages/th.ts .
4) (nit-picking) error
field seems to be unused
I think it is? I have not checked everything. I feel we could leverage this in a better way.
Suggestion
It’s getting late and I might very well be missing things at this point. So take the following suggestion with a grain of salt and double check it’s validity. I’m not afraid of told where I’m wrong or where I missed something. After all, this shall be a discussion to find a good solution together. There might be more straightforward or better solutions here.
In the old Sentence Collector the error
property defined on the rule was directly returned to the frontend and shown. This was after submitting the sentence. Before submitting you had no idea what the actual validation rules are, so the current list of requirements is certainly an improvement already. I think those two approaches could be mixed.
Use error
for the list
For the requirement list we could assemble the list with two ways:
- Static content as already there, such as spelling and grammar, or appropriate citation
- … and combine it with the list of
error
messages of the respective validation file for the language
Note that in the case of the new interface these would not even need to be translated, they could just be in their respective language, as the contribution interface will match that language anyway. Some of the validation files have English error messages, but that could be changed by specifically reaching out to the contributors and asking them for a correction.
For the default validation file we could still rely on the Pontoon translations as-is.
Mapping error cases and entries
To be able to correctly identify and mark the not fulfilled requirement, we would need to have unique identifiers per validation rule. This could be a constant, number or even just the index in the array, as long as the errors as well as that unique identifier can be fetched from the backend and be used to populate the list. Fetching this information from the backend is likely not much overhead in general traffic, however I do not have any numbers on this. Alternatively there are hacky ways to integrate this at build time, but we most likely want to avoid that.
Result
For Catalan this would result in the following list (I’m very sorry for not translating the first few items):
-
No copyright restrictions (cc-0)
- Use correct grammar
- Use correct spelling and punctuation
- Include appropriate citation
- Ideally natural and conversational (it should be easy to read the sentence)
- El nombre de paraules ha de ser entre 1 i 14 (inclòs)
- La frase no pot contenir nombres
- La frase no pot contenir signes de puntuació al mig
- La frase no pot contenir simbols o multiples espais o exclamacions
- La frase no pot contenir abreviacions o acrònims