Common Voice New Sentence Collector

gina · May 10, 2023, 1:37pm

Hey, everyone! We’re excited to announce that the new sentence collector is now live on https://commonvoice.allizom.org/en/write. We’d really appreciate it if you could take a few minutes to check it out and let us know if everything is working as expected. There are some awesome new features to explore, so we hope you enjoy using them! Thanks for your support.

mkohler · May 10, 2023, 8:16pm

Hi Gina, thanks for sharing this! I’m very happy it has come to this, after talking about possibly 2 or more years ago.

The link you posted points to the staging environment, so here is the production website one:

https://commonvoice.mozilla.org/en/write

I have one question: what is the plan with the validation requirements? Currently the text is static, and not all languages have the same validation. For example Italian is based on number of characters, but would still say 15 words. I feel this could be confusing, mostly if a validation fails but it’s not even a requirement that is listed.

That being said, I will archive the old Sentence Collector repository later tonight and transfer possibly still relevant issues to the main Common Voice repository.

Great work!

stergro · May 11, 2023, 7:34am

Are all old unreviewed sentences from the old sentence collector gone? Some languages had quite a lot of sentences that weren’t exported yet. There are probably technical reasons for that.

bozden · May 11, 2023, 8:59am

I think this can be handled through a knowledgeable translation on Pontoon:

Translation managers should be aware of this thou… Maybe a description text would help…

bozden · May 11, 2023, 10:16am

@stergro, here is the answer:

They will be imported.

mkohler · May 11, 2023, 10:24pm

For any sufficiently simple rule files, sure. But even then I see a few issues. I’ve spent some time looking deeper into this to provide as much info as possible. I’ll admit, I did not spend as much time initially to review this. This is also gonna get a bit longer as I’d like this to serve as bug report as well. Let’s also consider that this is a first version of the rewrite and improvements certainly are already planned. This might lead to some frustration, but overall contributors might just choose another sentence to submit and then it works. I absolutely have no data on this. Also, even though it’s not fully correct, I really appreciate the requirement list as that is a great improvement over the previous version of the Sentence Collector.

There are 8 bullet point items in that list. Arguably only 3 of them are actually dynamic (fewer than 15 words, no numbers and special characters, and no foreign letters). The others easily can be seen as valid for all the languages.

Problem statement

1) Remapping by translation is not possible because there is no translatable error key

The default file has the following validations and they only partially map against the actual requirements:

ERR_TOO_LONG → “Fewer than 15 words”
ERR_NO_NUMBERS → “No numbers and special characters”
ERR_NO_SYMBOLS → “No numbers and special characters”
ERR_NO_ABBREVIATIONS → Leads to an error, but does not show which one
(unused in default file) → No foreign letters

This shows the first two problems in the default validator. Even though we check for abbreviations by default, we do not have a special requirement in the list. I’d argue that abbreviations are common enough to warrant inclusion in the main list. For the default file we could replace “No foreign letters” with “No abbreviations”. However the ERR_NO_ABBREVIATIONS error code is not mapped to that “No foreign letters” item, so it won’t yet get marked in red if an error occurs. So basically now we’re half-way through, we have the requirement in the list, but we do not explicitly highlight it on error.

What we could do now is remap the error code in the validation file from ERR_NO_ABBREVIATIONS to ERR_NO_FOREIGN_SCRIPT.

This would lead us to the following, working rule set:

  {
    type: 'regex',
    regex: /[A-Z]{2,}|[A-Z]+\.*[A-Z]+/,
    error: `${TRANSLATION_KEY_PREFIX}sc-validation-no-abbreviations`,
    errorType: ERR_NO_FOREIGN_SCRIPT,
  },

I would heavily argue against this, as this is purely confusing and looks like a bug. However, for the default rules file this would somewhat work out and we could cover all cases.

More complex cases as described further down still would not be possible, except if you keep the OTHER category and map it against a very generic requirement string. But then it kinda loses its power.

2) Translators do not necessarily have technical knowledge

Not all translators using Pontoon have a technical background. Even if somebody can provide “knowledgeable” translations, I’m fairly sure that not all translators would either be aware of the individual rule files, and/or fully understand them. We should not assume technical knowledge required to contribute translations. The current versions of the rules files, if somebody finds the actual file on GitHub, have actual spelled out error messages (which possibly are unused). This might help but still needs a lot of hurdles to be crossed first.

3) More complex validation files

The default validation file is rather simple. We currently have 20 language validation files, the default one being one of them. Out of these 20 files 10 have a ERR_OTHER error type which is not explicitly marked as requirement. These range very far in what they represent, and it’s almost impossible to know exactly what failed if you happen to submit a sentence that fails that validation. Granted these are likely edge cases, but still can be very frustrating to not have any feedback apart from a generic error when submitting.

There was a lot of effort put into the Thai validation file, resulting in 23 different rules in the validation file: common-voice/server/src/core/sentences/validation/languages/th.ts at main · common-voice/common-voice · GitHub .

4) (nit-picking) error field seems to be unused

I think it is? I have not checked everything. I feel we could leverage this in a better way.

Suggestion

It’s getting late and I might very well be missing things at this point. So take the following suggestion with a grain of salt and double check it’s validity. I’m not afraid of told where I’m wrong or where I missed something. After all, this shall be a discussion to find a good solution together. There might be more straightforward or better solutions here.

In the old Sentence Collector the error property defined on the rule was directly returned to the frontend and shown. This was after submitting the sentence. Before submitting you had no idea what the actual validation rules are, so the current list of requirements is certainly an improvement already. I think those two approaches could be mixed.

Use `error` for the list

For the requirement list we could assemble the list with two ways:

Static content as already there, such as spelling and grammar, or appropriate citation
… and combine it with the list of error messages of the respective validation file for the language

Note that in the case of the new interface these would not even need to be translated, they could just be in their respective language, as the contribution interface will match that language anyway. Some of the validation files have English error messages, but that could be changed by specifically reaching out to the contributors and asking them for a correction.

For the default validation file we could still rely on the Pontoon translations as-is.

Mapping error cases and entries

To be able to correctly identify and mark the not fulfilled requirement, we would need to have unique identifiers per validation rule. This could be a constant, number or even just the index in the array, as long as the errors as well as that unique identifier can be fetched from the backend and be used to populate the list. Fetching this information from the backend is likely not much overhead in general traffic, however I do not have any numbers on this. Alternatively there are hacky ways to integrate this at build time, but we most likely want to avoid that.

Result

For Catalan this would result in the following list (I’m very sorry for not translating the first few items):

No copyright restrictions (cc-0)
Use correct grammar
Use correct spelling and punctuation
Include appropriate citation
Ideally natural and conversational (it should be easy to read the sentence)
El nombre de paraules ha de ser entre 1 i 14 (inclòs)
La frase no pot contenir nombres
La frase no pot contenir signes de puntuació al mig
La frase no pot contenir simbols o multiples espais o exclamacions
La frase no pot contenir abreviacions o acrònims

bozden · May 11, 2023, 10:47pm

Great review @mkohler.

I think part of this problem is caused by the limitations of Fluent/Pontoon, which we already discussed in the past about the dinosaur examples. They were based on English language rules and people have to adapt/localize them with the rules of their language. But there were N of them, and you should provide N adaptations, not more, not less.

One idea I can think of is to divide the rules into two groups, “common rules” and “language-specific rules” and let the second one have (say) 10 variables. The code checks if it is empty and outputs if it is not.

kushvisk · May 12, 2023, 9:35pm

Can multiple sentences be added at once?

kathyreid · May 13, 2023, 2:33am

I just want to give my sincere thanks to the team, this new collector is much better for data provenance and sourcing. Big ups

Essaidib2 · May 13, 2023, 8:52pm

Hi,
New sentence collector is not reflected in recently launched common voice site of Tamazight (zgh).

gina · May 16, 2023, 8:05am

Thank you, we will check the issue and revert with the solution.

gina · May 16, 2023, 8:12am

As of right now we support only one sentence per ‘entry’. Submitting multiple sentences separated by newline as it was possible in the old Sentence Collector will be implemented on the next iteration.

jesslynnrose · May 17, 2023, 12:48pm

Not right now (it’s one sentence per submission) but we have plans that should make submitting multiple sentences at once easier in the near future and we have documentation for Bulk Sentence Uploads if you want to send us a lot of sentences at once!

ok_alp · July 28, 2023, 11:21am

Hi Jess,

Is there any date on the horizon for bulk sentence uploads?

I can manage the bulk upload through Github also, but I was wondering if they end up directly in the sentence set or they go through reviewing. I’d prefer they went through review as my corpus is rather large and does contain some errors that I cannot pick out myself manually.

jesslynnrose · August 2, 2023, 9:10am

Hello! For now, we don’t have a landing date for the updated bulk sentence process. Github bulk sentence uploads will go through the review process, if you don’t mind sending them in this way!

Freso · August 12, 2023, 11:10am

This link seems to be dead. Looks like the file name spelling was fixed (by you ) so new link is

github.com/common-voice/common-voice

docs/submitting-bulk-sentences.md

main

All the sentences read by our contributors are sourced from copyright free sources or permissioned sources, either through [automated sourcing](https://github.com/common-voice/cv-sentence-extractor) by agreement from sources like Wikipedia or contributed by our language communities.

If you want to add a single sentence, or a series of less than 1000 sentences, you can do so via the [Sentence Collection](https://commonvoice.mozilla.org/write) page on the [Common Voice website](https://commonvoice.mozilla.org). To contribute a larger number of sentences (1000+) at once, you can use the Bulk Sentence upload option. _Remember, only files with more than 1000 sentences will be manually processed due to our small team size._

Remember that for both single sentence and bulk sentence submissions, sentences must:
- Be in the public domain, with a CC0 license
- Be short, readable and take about 10-15 seconds to read
- Avoid including numbers or special characters
## Formatting Your Bulk Sentences
To upload bulk sentences, you'll need to have created a TSV file with your sentences. 

Please format your bulk sentences into a [.TSV](https://en.wikipedia.org/wiki/Tab-separated_values) file with five columns, containing the following columns from left to right:
- Your sentence
- The source for your sentence
- Additional infomation about why this source is eligible for inclusion in a CC0 dataset
- A blank column for our team to use in the quality control process
- An optional column showing the domain for sentences, as described in [this blog post](https://foundation.mozilla.org/blog/domain-datasets-common-voice/)
- An optional column for the language variant of sentence(s), where applicable

The more information you are able to provide in the Source column, the easier it will be to get your bulk sentence submission validated.

This file has been truncated. show original

Thanks for the pointer!

Topic		Replies	Views
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8860	January 9, 2019
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	4015	January 26, 2019
Common Voice Sentence Collection Tool launch Common Voice sentence-collection , announcements	15	4254	April 2, 2019
Feedback about validated sentences Common Voice sentence-collection	11	1487	February 19, 2019
Mass import sentences into Sentence Collector Common Voice sentence-collection	5	628	February 7, 2019