Is there still a way for adding more than 1000+ sentences at once?

There is a lack of uniformity between different parts of documentation:

" If you want to add a single sentence, or a series of less than 1000 sentences, you can do so via the Sentence Collection page on the Common Voice website. To contribute a larger number of sentences (1000+) at once, you can use the Bulk Sentence upload option. Remember, only files with more than 1000 sentences will be manually processed due to our small team size.",

I don’t see this option now. Does it mean that bulk sentence upload is not supported anymore? Or it means that files should be sent to email and will be checked only if they have >1000 sentences in them?

Hey @Libra, thank you for the question.

That option is still there, but not public anymore. It is mainly a manual process, involving the whole CV team, external emails for verification (mostly hard to come by), any in many times also Mozilla Legal, and it takes up to a month.

After adding that, CV also added small-batch, where you can send them up to 1000 per variant+source, by yourself. It will require two votes, yes, but this will also increase the text-corpus quality. We had some problems in the past because of bulk-addition of not-so-good-quality sentences.

For this, and because we don’t have the bandwidth to deal with bulk additions, we had to hide it.

If you insist to use that facility (e.g. you have >100k CC0 sentences), you can send a request to commonvoice@mozilla.com

PS: We will fix the doc, thanks for the ping…

1 Like

Hi, thanks for clarification!

If you insist to use that facility (e.g. you have >100k CC0 sentences), you can send a request to commonvoice@mozilla.com

I just remembered that I definitely saw this possibility but I didn’t success to find it now, so wanted to check what exactly happened.

Unfortunately, I don’t have this amount of data. I’m one of members Tatoeba project who uses CC0 license for their sentences, but it is only about 15-20k sentences that should be cleared, because now many of them includes numbers, non-native alphabet characters, are longer than 14 words etc. It should not be so hard to do that even with basic sql queries, but I don’t want to waste your forces for that. Maybe I will do that later by myself and will separate results to tsv tables with less 1000 sentences.

After adding that, CV also added small-batch, where you can send them up to 1000 per variant+source, by yourself. It will require two votes, yes, but this will also increase the text-corpus quality. We had some problems in the past because of bulk-addition of not-so-good-quality sentences.

By the way, how does this small-batch suggestion work? All sentences will be added for voting to DB as if I would add them one-by-one (so every sentence from this list should be approved by 2 voices) or I add list with 1000 sentences, then some percent of them (5% for instance) is randomly taken for checking and if they are approved without any problems, then all other are approved and added automatically?

1 Like

Keep them somewhere :slight_smile: We will be increasing those limits…

A) One line per sentence, up to 999 :slight_smile:
Select “Domain” (if specific) - or select “general” or leave empty, "write “source”…
If the language has variants and the sentences are variant specific, select variant.
Send them.
Wait 2 minutes (rate-limited)
Send second batch

15-20k sentence

20 postings => 40 min

B) Then 2 people (one can be you) should vote them.
Rate limited: 2 sec per vote (30 max/min)

C) Whenever they get 2 YES votes, they will be available to be recorded at max 1 hour later (there is a cron job for that) - does not wait until all validated.

D) For recording, from the least recorded ones (these will be recorded 0 times) a random 25 will be fed to the Speak page…

Does this help?

1 Like

This is something I know already :slight_smile:

B) Then 2 people (one can be you) should vote them.
Rate limited: 2 sec per vote (30 max/min)

I mean if every sentence from this list should be approved separately or there is some percent that should be approved and after that all list is approved automatically?

All other info is useful to know. Thanks!

Point C above answers that I guess?

One note: What I say is not valid for a new language. A new language should first complete their designated Band A/B/C requirement (5000/2000/750).

Sorry I missed this part “does not wait until all validated”