Common Voice New Sentence Collector

Hey, everyone! We’re excited to announce that the new sentence collector is now live on https://commonvoice.allizom.org/en/write. We’d really appreciate it if you could take a few minutes to check it out and let us know if everything is working as expected. There are some awesome new features to explore, so we hope you enjoy using them! Thanks for your support.

2 Likes

Hi Gina, thanks for sharing this! I’m very happy it has come to this, after talking about possibly 2 or more years ago.

The link you posted points to the staging environment, so here is the production website one:

https://commonvoice.mozilla.org/en/write

I have one question: what is the plan with the validation requirements? Currently the text is static, and not all languages have the same validation. For example Italian is based on number of characters, but would still say 15 words. I feel this could be confusing, mostly if a validation fails but it’s not even a requirement that is listed.

That being said, I will archive the old Sentence Collector repository later tonight and transfer possibly still relevant issues to the main Common Voice repository.

Great work!

2 Likes

Are all old unreviewed sentences from the old sentence collector gone? Some languages had quite a lot of sentences that weren’t exported yet. There are probably technical reasons for that.

1 Like

I think this can be handled through a knowledgeable translation on Pontoon:

https://pontoon.mozilla.org/tr/common-voice/web/locales/en/messages.ftl/?status=missing&string=279508

Translation managers should be aware of this thou… Maybe a description text would help…

1 Like

@stergro, here is the answer:

They will be imported.

1 Like

For any sufficiently simple rule files, sure. But even then I see a few issues. I’ve spent some time looking deeper into this to provide as much info as possible. I’ll admit, I did not spend as much time initially to review this. This is also gonna get a bit longer as I’d like this to serve as bug report as well. Let’s also consider that this is a first version of the rewrite and improvements certainly are already planned. This might lead to some frustration, but overall contributors might just choose another sentence to submit and then it works. I absolutely have no data on this. Also, even though it’s not fully correct, I really appreciate the requirement list as that is a great improvement over the previous version of the Sentence Collector.

There are 8 bullet point items in that list. Arguably only 3 of them are actually dynamic (fewer than 15 words, no numbers and special characters, and no foreign letters). The others easily can be seen as valid for all the languages.

Problem statement

1) Remapping by translation is not possible because there is no translatable error key

The default file has the following validations and they only partially map against the actual requirements:

  • ERR_TOO_LONG -> “Fewer than 15 words”
  • ERR_NO_NUMBERS -> “No numbers and special characters”
  • ERR_NO_SYMBOLS -> “No numbers and special characters”
  • ERR_NO_ABBREVIATIONS -> Leads to an error, but does not show which one
  • (unused in default file) -> No foreign letters

This shows the first two problems in the default validator. Even though we check for abbreviations by default, we do not have a special requirement in the list. I’d argue that abbreviations are common enough to warrant inclusion in the main list. For the default file we could replace “No foreign letters” with “No abbreviations”. However the ERR_NO_ABBREVIATIONS error code is not mapped to that “No foreign letters” item, so it won’t yet get marked in red if an error occurs. So basically now we’re half-way through, we have the requirement in the list, but we do not explicitly highlight it on error.

What we could do now is remap the error code in the validation file from ERR_NO_ABBREVIATIONS to ERR_NO_FOREIGN_SCRIPT.

This would lead us to the following, working rule set:

  {
    type: 'regex',
    regex: /[A-Z]{2,}|[A-Z]+\.*[A-Z]+/,
    error: `${TRANSLATION_KEY_PREFIX}sc-validation-no-abbreviations`,
    errorType: ERR_NO_FOREIGN_SCRIPT,
  },

I would heavily argue against this, as this is purely confusing and looks like a bug. However, for the default rules file this would somewhat work out and we could cover all cases.

More complex cases as described further down still would not be possible, except if you keep the OTHER category and map it against a very generic requirement string. But then it kinda loses its power.

2) Translators do not necessarily have technical knowledge

Not all translators using Pontoon have a technical background. Even if somebody can provide “knowledgeable” translations, I’m fairly sure that not all translators would either be aware of the individual rule files, and/or fully understand them. We should not assume technical knowledge required to contribute translations. The current versions of the rules files, if somebody finds the actual file on GitHub, have actual spelled out error messages (which possibly are unused). This might help but still needs a lot of hurdles to be crossed first.

3) More complex validation files

The default validation file is rather simple. We currently have 20 language validation files, the default one being one of them. Out of these 20 files 10 have a ERR_OTHER error type which is not explicitly marked as requirement. These range very far in what they represent, and it’s almost impossible to know exactly what failed if you happen to submit a sentence that fails that validation. Granted these are likely edge cases, but still can be very frustrating to not have any feedback apart from a generic error when submitting.

There was a lot of effort put into the Thai validation file, resulting in 23 different rules in the validation file: https://github.com/common-voice/common-voice/blob/main/server/src/core/sentences/validation/languages/th.ts .

4) (nit-picking) error field seems to be unused

I think it is? I have not checked everything. I feel we could leverage this in a better way.

Suggestion

It’s getting late and I might very well be missing things at this point. So take the following suggestion with a grain of salt and double check it’s validity. I’m not afraid of told where I’m wrong or where I missed something. After all, this shall be a discussion to find a good solution together. There might be more straightforward or better solutions here.

In the old Sentence Collector the error property defined on the rule was directly returned to the frontend and shown. This was after submitting the sentence. Before submitting you had no idea what the actual validation rules are, so the current list of requirements is certainly an improvement already. I think those two approaches could be mixed.

Use error for the list

For the requirement list we could assemble the list with two ways:

  • Static content as already there, such as spelling and grammar, or appropriate citation
  • … and combine it with the list of error messages of the respective validation file for the language

Note that in the case of the new interface these would not even need to be translated, they could just be in their respective language, as the contribution interface will match that language anyway. Some of the validation files have English error messages, but that could be changed by specifically reaching out to the contributors and asking them for a correction.

For the default validation file we could still rely on the Pontoon translations as-is.

Mapping error cases and entries

To be able to correctly identify and mark the not fulfilled requirement, we would need to have unique identifiers per validation rule. This could be a constant, number or even just the index in the array, as long as the errors as well as that unique identifier can be fetched from the backend and be used to populate the list. Fetching this information from the backend is likely not much overhead in general traffic, however I do not have any numbers on this. Alternatively there are hacky ways to integrate this at build time, but we most likely want to avoid that.

Result

For Catalan this would result in the following list (I’m very sorry for not translating the first few items):

  • No copyright restrictions (cc-0)
  • Use correct grammar
  • Use correct spelling and punctuation
  • Include appropriate citation
  • Ideally natural and conversational (it should be easy to read the sentence)
  • El nombre de paraules ha de ser entre 1 i 14 (inclòs)
  • La frase no pot contenir nombres
  • La frase no pot contenir signes de puntuació al mig
  • La frase no pot contenir simbols o multiples espais o exclamacions
  • La frase no pot contenir abreviacions o acrònims
1 Like

Great review @mkohler.

I think part of this problem is caused by the limitations of Fluent/Pontoon, which we already discussed in the past about the dinosaur examples. They were based on English language rules and people have to adapt/localize them with the rules of their language. But there were N of them, and you should provide N adaptations, not more, not less.

One idea I can think of is to divide the rules into two groups, “common rules” and “language-specific rules” and let the second one have (say) 10 variables. The code checks if it is empty and outputs if it is not.

Can multiple sentences be added at once?

I just want to give my sincere thanks to the team, this new collector is much better for data provenance and sourcing. Big ups :heart:

1 Like

Hi,
New sentence collector is not reflected in recently launched common voice site of Tamazight (zgh).

Thank you, we will check the issue and revert with the solution.

1 Like

As of right now we support only one sentence per ‘entry’. Submitting multiple sentences separated by newline as it was possible in the old Sentence Collector will be implemented on the next iteration.

Not right now (it’s one sentence per submission) but we have plans that should make submitting multiple sentences at once easier in the near future and we have documentation for Bulk Sentence Uploads if you want to send us a lot of sentences at once!

Hi Jess,

Is there any date on the horizon for bulk sentence uploads?

I can manage the bulk upload through Github also, but I was wondering if they end up directly in the sentence set or they go through reviewing. I’d prefer they went through review as my corpus is rather large and does contain some errors that I cannot pick out myself manually.

1 Like

Hello! For now, we don’t have a landing date for the updated bulk sentence process. Github bulk sentence uploads will go through the review process, if you don’t mind sending them in this way!

1 Like

This link seems to be dead. Looks like the file name spelling was fixed (by you :slight_smile:) so new link is

Thanks for the pointer!