New Language Workflow: 5K Sentences Requirement?

TLDR: How could we change the 5000 sentence collection requirement to be more inclusive of a language community needs ? e.g low-resourced, not many speakers

Hey everyone,

At the contribute-athon sessions, we discussed some of the ideas the Common Voice team has for New Language Workflow for Common Voice.

I want to ensure as many people can be involved in this discussion so, I have created this topic.

The language workflow is the process in which a language joins the Common Voice website for voice data collection. See this comment to understand how it works.

We want to improve the language workflow this includes but is not limited to; centralising documentation by including and evolving the Community Playbook onto the Common Voice Website.

To help us we would like to listen to the community thoughts on two questions:

  1. How could we change the 5000 sentence collection requirement to be more inclusive of a language community needs ? e.g low-resourced, not many speakers
  2. What documentation did you wish you had or still need to support your language’s journey being launched onto Common Voice for voice data contributions ?

We look forward to hearing your thoughts !

The current Language Workflow

  1. A person(s) requests a language via the website, github or community platforms.

  2. The person, connect begins to localise the platform via pontoon.

  • The Common Voice platform text is 90%+ localised via pontoon by the person and contributor translators

  • Roughly 5,000 public domain (cc0) sentences are collected via sentence extractor and sentence collector

  1. The language is live

I am answering from the perspective of a long time contributor to CV (so am not into newly started languages at all)

It took a long time for me to find out how to contribute new sentences and with what (Common Voice sentence collector).
After crawling through discourse i found it out.

The 5000 sentence requirement i would not change (worked for many languages so far), but the way to achieve it i would optimize.

On the CV main page are right now 2 sections,
speak and listen. A third section could be included (for example: new sentences) to show the contributor all ways of contribution to CV project right from the start.

The logic behind this: no new sentences, no new recordings, language section comes to a stop.
Or cannot start to record because of not enough sentences collected so far.

In the new third section is explained what common voice sentence collector is and a link to it and that cc0 sentences can be contributed for existing and newly started languages.

Also the possibility of uploading bigger text files (how to do it and what criteria is needed for doing so) is explained.


I want to repeat my suggestion during the meeting(s), for the general public:

Some major troubles the new & under-resourced languages are facing are:

  1. Finding 5000 CC-0 sentences
  2. Finding the first group of volunteers to record these.
  3. Translating the UI through Pontoon (90%)

I learned from the meeting that the “5000” sentences requirement is based on AI/ML-related technical minimum, namely, you would need that minimum to be able to build an initial model.

I don’t think that would apply here…

I suggest, for languages with low amount of native speakers (like cencus data), that requirement may be relaxed. Few people speaking the language, few sentence resources, few volunteers, etc.

I think setting a minimum threshold for 5000, such as 1.000.000 native speakers (just an example), and require half of it (2500) if the speaker number is 500.000 for example. This is very important especially for dying languages IMHO…

When you adapt such a process, CV will be the anchor point for new volunteers… These volunteers will keep working on completing UI, sentences, recordings. Otherwise 5000 sentences will be done with very few people, making the first dataset very biased anyways.


This IS utmost important, as I mentioned everywhere… Like this one:


Yes, and many contributors do not know about contributing (small amounts of) own sentences and larger text files and. …leave.
What do you think of the idea of implementing this on CV main page in a new third section? (creating own sentences - every sentence counts for existing and newly started languages, how to submit larger files)

On preserving language(s) lowering the requirements could be an option with the danger of running out of sentences. So CV must find a way to show this all contributors.

Also filed for a forwarded process for this on github

All credits for this one to @bozden !

  1. Finding 5000 CC-0 sentences
  • In the worst case there is no material online or behind paywalls.
    Rare printed material is copyrighted, you have to pay for authors switching to cc0.
    Knowledge for this language/dialect is passed mouth to mouth.
  1. Finding the first group of volunteers to record these.
    Are there any institutions to preserve this language/dialect (statewise)? National archive exist for this?
    Are there private interst groups for this language dialect? On and offline! You do not reach those with social media campains, when they have only gsm phones.
    Are there offline courses for this? evening schools for interested persons ?

For those who are short on sentences, we have two methods we started using/will be using: (1) Volunteers write down utterances from their everyday speeches (2) We will open a CC-0 chatroom and invite everybody there to write something/converse there…

You can easily reach 5000 using methods like these… It is not mandatory to have CC-0 literary works. After that point, you can research more, once you have an anchor here.


Creating (own) sentences is the key function :sunglasses:

A good one that we can add to best practice! We take the same approach to open a Chinese CC0 chatroom at local civil hacking community slack in the past 2 years, now we had tens of thousands of sentence materials available to be use.

This restriction is not necessary.

In the very beginning of Common Voice (likely in first year), we don’t enforce a “one sentence recording only once” rule. That is a request from DeepSpeech that its model only consider one sentences once.

We can definitely ease that restriction for that Common Voice is not serving to single specific project anymore.


One more way to get cc0 sentences:
Take 5000 verified cc0 english (or whatever amount is needed) sentences and translate those via pantoon to newly started language.

The automated translation results are mostly not good (classic example: user manuals)

In my opinion this is not a good idea. Translations will suffer from translationese and not be idiomatic. In this case, it is better to just come up with sentences from scratch. It will be around the same amount of work as translating, but the sentences will be more idiomatic.

