[Share your views] Nuancing sentence collection requirements - New Sentence Collection Bands

heyhillary · February 15, 2022, 12:23pm

Hello Common Voice Community,

Sentence collection is a core component of enabling languages on Common Voice for voice contributions. We would like to thank you to all for your feedback about sentence collection requirement.

Based on your input, and by working closely with a linguistic advisor, Francis to design a more equitable system. We are proposing the following system for your review:

TLDR;

Communities will be able to choose from three bands, based on the size of population, the resources they have at their disposal, and the vitality of their language.

Depending on which band they feel best describes their language, they could need to collect fewer than 5000 sentences in order to get started.

By 3rd March AOE please share your reviews via discourse or by email to commonvoice@mozilla.com

New Sentence Collection Bands

Please click on the arrows to view the details and why a language community might be in this category

Band A: 750 sentences

Why might a community be in this category?

Speaker population: Small (~10 speakers - 1M speakers)
Resourcing self assessment: Low
- Necessary skills (code, literacy)
- Internet penetration
- Public domain text
Language vitality level: Low (Graded Intergenerational Disruption Scale 7-8)

Examples: Basaa, Hakha Chin, Votic, Breton, Chuvash

Band B: 2,000 sentences

Why might a community be in this category?

Speaker population: Medium-High (~1M speakers - 10M speakers)
Resourcing self assessment: Medium
- Necessary skills (code, literacy)
- Internet penetration
- Public domain text
Language vitality level: Moderate (Graded Intergenerational Disruption Scale 4-6)

Examples: Guarani, Kiswahili, Kurdish, Mongolian, Hausa, Azerbaijani

Band C: 5,000 sentences (7 hours of no-duplicate data)

Why might a community be in this category?

Speaker population: Large (over 10M speakers)
Resourcing self assessment: High
- Necessary skills (code, literacy)
- Internet penetration
- Public domain text
Language vitality level: High (Graded Intergenerational Disruption Scale 1-4)

Examples: English, Arabic, Chinese, Belarusian, Estonian, Hindi

What’s Next ?

New communities will be able to share this information via the new language request github issue. For languages which are already in progress, we will reach out during the next two weeks.

Our goal is that communities can get started more quickly and build momentum. But remember, sentence collection must be an ongoing process in order to limit repetition, which at scale may negatively affect the health of the dataset for model training purposes. We will be working on community support mechanisms to notify communities when they may be running out of sentences.

We look forward to reading your reviews if you would prefer to email please message commonvoice@mozilla.com.

If you are happy to localise this communication please message me via discourse or element.

Thank you on behalf of the Common Voice Team

References

daniel.abzakh · February 15, 2022, 7:52pm

Hello,

I think this can be very useful for the community support desk, in three aspects:

[Information gathering stage] This information might help understand the strengths and weaknesses of each community, in order to build a strategy for support.
[Knowledge transfer stage] Resource allocation. This can help identify resources and move them around where is needed (i.e allocate resources from high band to low band languages.
[Activation stage] An indicator, ultimately the goal is to move languages from low band to high band.

heyhillary · March 9, 2022, 2:42pm

Open Knowledge Foundation by Prasanta - Made a datasheet on Common voice from the perspectives of Santali: https://github.com/ofdn/Before-AI/blob/main/datasheet/Santali%20Common%20Voice.md

This includes an analysis of sentences

Topic		Replies	Views
New Language Workflow: 5K Sentences Requirement? Common Voice feedback	13	1782	September 26, 2022
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	34	8982	December 17, 2018
Weekly Update: New Language Workflow - 22nd October 2021 Common Voice announcements	3	1366	October 26, 2021
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14437	May 10, 2022
Common Voice Roadmap Update Common Voice announcements	7	2350	September 3, 2019

[Share your views] Nuancing sentence collection requirements - New Sentence Collection Bands

New Sentence Collection Bands

What’s Next ?

References

Related topics