[Share your views] Nuancing sentence collection requirements - New Sentence Collection Bands

:wave:t6:Hello Common Voice Community,

Sentence collection is a core component of enabling languages on Common Voice for voice contributions. We would like to thank you to all for your feedback about sentence collection requirement.

Based on your input, and by working closely with a linguistic advisor, Francis to design a more equitable system. We are proposing the following system for your review:

TLDR;

  • Communities will be able to choose from three bands, based on the size of population, the resources they have at their disposal, and the vitality of their language.
  • Depending on which band they feel best describes their language, they could need to collect fewer than 5000 sentences in order to get started.
  • By 3rd March AOE please share your reviews via discourse or by email to commonvoice@mozilla.com

New Sentence Collection Bands

Please click on the arrows to view the details and why a language community might be in this category


Band A: 750 sentences

Why might a community be in this category?

  • Speaker population: Small (~10 speakers - 1M speakers)
  • Resourcing self assessment: Low
    • Necessary skills (code, literacy)
    • Internet penetration
    • Public domain text
  • Language vitality level: Low (Graded Intergenerational Disruption Scale 7-8)

Examples: Basaa, Hakha Chin, Votic, Breton, Chuvash


Band B: 2,000 sentences

Why might a community be in this category?

  • Speaker population: Medium-High (~1M speakers - 10M speakers)
  • Resourcing self assessment: Medium
    • Necessary skills (code, literacy)
    • Internet penetration
    • Public domain text
  • Language vitality level: Moderate (Graded Intergenerational Disruption Scale 4-6)

Examples: Guarani, Kiswahili, Kurdish, Mongolian, Hausa, Azerbaijani


Band C: 5,000 sentences (7 hours of no-duplicate data)

Why might a community be in this category?

  • Speaker population: Large (over 10M speakers)
  • Resourcing self assessment: High
    • Necessary skills (code, literacy)
    • Internet penetration
    • Public domain text
  • Language vitality level: High (Graded Intergenerational Disruption Scale 1-4)

Examples: English, Arabic, Chinese, Belarusian, Estonian, Hindi


What’s Next ?

New communities will be able to share this information via the new language request github issue. For languages which are already in progress, we will reach out during the next two weeks.

Our goal is that communities can get started more quickly and build momentum. But remember, sentence collection must be an ongoing process in order to limit repetition, which at scale may negatively affect the health of the dataset for model training purposes. We will be working on community support mechanisms to notify communities when they may be running out of sentences.

We look forward to reading your reviews if you would prefer to email please message commonvoice@mozilla.com.

If you are happy to localise this communication please message me via discourse or element.

Thank you on behalf of the Common Voice Team


References

3 Likes

Hello,

I think this can be very useful for the community support desk, in three aspects:

  1. [Information gathering stage] This information might help understand the strengths and weaknesses of each community, in order to build a strategy for support.

  2. [Knowledge transfer stage] Resource allocation. This can help identify resources and move them around where is needed (i.e allocate resources from high band to low band languages.

  3. [Activation stage] An indicator, ultimately the goal is to move languages from low band to high band.

Open Knowledge Foundation by Prasanta - Made a datasheet on Common voice from the perspectives of Santali: https://github.com/ofdn/Before-AI/blob/main/datasheet/Santali%20Common%20Voice.md

This includes an analysis of sentences

3 Likes