Streamlining Localization and Reducing Barriers for Common Voice Communities

Hello everyone :slightly_smiling_face: :grinning:

We would like to gather feedback on a situation that we have been monitoring for some time.

When Common Voice started, industry benchmarks suggested that around 5000 hours of ASR training data might be required for training a robust STT model that could be deployed in a product like a voice assistant. Over the years, more foundational multilingual models, and more methodologies for fine-tuning models to new linguistic contexts, have emerged, and many communities are deciding not to train from scratch, but to build on these technologies.

In these contexts, communities may only be aiming to generate small datasets, for example 50 hours. We are also increasingly taking language requests from Band A languages (small speaker populations, low resource contexts etc). In this context, the current localisation burden of getting started on Common Voice (~1500 strings on Pontoon!) may seem particularly oversized.

We want to better support a range of different data collection modalities. In light of this, we are considering overhauling our localisation approach to enabling a new language on Common Voice.

In particular, reducing localisation requirements for go-live to only include the text on the core contribute UIs (Speak, Listen, Write), the Download modal and the legal documents (privacy, terms). This would not prevent communities from localising further, but it would mean that communities could start collecting data sooner.

We welcome your input on this proposal.

Looking forward to your great inputs.

CV Team


This is awesome. But as someone who work with a few minority languages, I think the biggest obstacle is not only the website localization, but also the steps prior to it, especially #1unclear requirements of requesting a new language, and #2 the lack of guidances on the text/written part of the requested language.

For #1, as an example, I have been thinking about adding a new language called Teochew. However, there is currently no ISO 693-3 code for this language. There is only a proposal from 2021 to split the existing nan code into a few sub-languages, which includes a new code tws for Teochew. The language request page of Common Voice right now doesn’t say anything about whether the new language must have an ISO code or not, or a code in proposal is also acceptable. It would be nice if the decision process is more transparent, and let the community know when a request will get accepted or not.

For #2, most minority languages are largely unwritten, i.e. they don’t have a writing tradition or a writing system and most people don’t know how to read or write their mother tongue (which is also the exact reason why they are low-resource). This actually is the biggest blocker for most Pontoon localization work. Because people simply don’t know how to write the language, let alone doing translation. They have to think of a way to write down their “spoken tongues” properly or at least consistently.

This also applies to languages with multiple writing systems. Should they be requested as different languages, or pick one writing system for it? As an example, the current Taiwanese Southern Min sentences combine both Chinese characters and Tai-lo (romanization spellings) together. This is actually a repetition in the ASR text data, which isn’t a normal practice and necessitate additional pre-processing work in downstream ASR training. Is this acceptable for the Common Voice dataset? More guidelines should be provided in this process.

I think this is where Common Voice can step in. I am not saying that we should ask people to start prescribing an orthography or pick a writing system for a language. But I think Common Voice could provide a few existing locales as successful examples, and start closer collaboration with the locale administrator / language community to provide some sort of guidance to inform people how to write their languages / choose the writing system when a new language is proposed.

1 Like

Because people simply don’t know how to write the language, let alone doing translation.

Just a minor addition to these very nice points: In the case of endangered languages (<10.000), most speakers are very old, illiterate, and naturally don’t use computers/smartphones either. We are trying to find ways to reach (record) them via young volunteers - also trying to play along with the CV rules/system (e.g. pre-reading the sentences to them so they can repeat them).

Also machine translations will not help these languages, as they never get modeled.

A second point: Most of the phrases in Pontoon can be moved into multilingual documentation and that would drop the total considerably.


Just came across another good examples

I believe this issue will become more common as we expand the language support to more minority languages.