Hello Common Voice Community,
Sentence collection is a core component of enabling languages on Common Voice for voice contributions. We would like to thank you to all for your feedback about sentence collection requirement.
Based on your input, and by working closely with a linguistic advisor, Francis to design a more equitable system. We are proposing the following system for your review:
TLDR;
- Communities will be able to choose from three bands, based on the size of population, the resources they have at their disposal, and the vitality of their language.
- Depending on which band they feel best describes their language, they could need to collect fewer than 5000 sentences in order to get started.
- By 3rd March AOE please share your reviews via discourse or by email to commonvoice@mozilla.com
New Sentence Collection Bands
Please click on the arrows to view the details and why a language community might be in this category
Band A: 750 sentences
Why might a community be in this category?
- Speaker population: Small (~10 speakers - 1M speakers)
- Resourcing self assessment: Low
- Necessary skills (code, literacy)
- Internet penetration
- Public domain text
- Language vitality level: Low (Graded Intergenerational Disruption Scale 7-8)
Examples: Basaa, Hakha Chin, Votic, Breton, Chuvash
Band B: 2,000 sentences
Why might a community be in this category?
- Speaker population: Medium-High (~1M speakers - 10M speakers)
- Resourcing self assessment: Medium
- Necessary skills (code, literacy)
- Internet penetration
- Public domain text
- Language vitality level: Moderate (Graded Intergenerational Disruption Scale 4-6)
Examples: Guarani, Kiswahili, Kurdish, Mongolian, Hausa, Azerbaijani
Band C: 5,000 sentences (7 hours of no-duplicate data)
Why might a community be in this category?
- Speaker population: Large (over 10M speakers)
- Resourcing self assessment: High
- Necessary skills (code, literacy)
- Internet penetration
- Public domain text
- Language vitality level: High (Graded Intergenerational Disruption Scale 1-4)
Examples: English, Arabic, Chinese, Belarusian, Estonian, Hindi
What’s Next ?
New communities will be able to share this information via the new language request github issue. For languages which are already in progress, we will reach out during the next two weeks.
Our goal is that communities can get started more quickly and build momentum. But remember, sentence collection must be an ongoing process in order to limit repetition, which at scale may negatively affect the health of the dataset for model training purposes. We will be working on community support mechanisms to notify communities when they may be running out of sentences.
We look forward to reading your reviews if you would prefer to email please message commonvoice@mozilla.com.
If you are happy to localise this communication please message me via discourse or element.
Thank you on behalf of the Common Voice Team