[Feature request] Dialects/language variants

psubhashish · March 22, 2023, 2:12pm

This is a feature request to let the language community, as a whole and not just as individuals, add dialects/ variants both at the sentence collection stage. My request to Mozilla staff developers and other community members not to immediately discard saying we’ve discussed this before and think about this subject with the context, in this case, my own language Odia.

Languages are complex, and tech is not neutral. Many languages have many distinct forms. For instance, the formal Odia in writing deviates a lot from the spoken variants, both in sentence structure and vocabulary. Speech data should capture the distinction and the nuances as it’s not written content. The sentence collection process assumes that the sentences are in a language that is highly codified and formalized and is a standard written form. People only pronounce words differently in different places where this language is spoken, just like English. That IMO is an oversimplification leading to poor design. In reality, a sentence in a dialect can only be said naturally by a native speaker. There is no point for non-native speakers to waste their time recording such sentences. But sentences in dialects should exist on CV to capture the diverse accents, intonations and vocabulary in natural speech. To simplify, all sentences in the standard written form of a language should be available to all for recording. But speech contributors should be able to only see sentences in a particular dialect when they choose that dialect in their settings. Think of this as a filter that can also save volunteer time.

As sentences uploaded are made available to all users without letting them filter but only skip, sentence contributors would hesitate to contribute sentences in their dialects. Secondly, there is no indicator of a sentence if that is in a particular dialect or standard writing. The natural response to this flatness and unnatural neutrality is that a user would assume they must sound formal.

As CV matures with more speech data, adding a layer without too much visual/UI complication would matter a lot.

Secondly, allowing users to choose their dialect is a good step. But dialects don’t grow overnight. Each language has a limited set of dialects. Can that list not be displayed as a dropdown? Say, I want to record 5 sentences in formal/standard Odia and 5 in the Baleswari dialect of Odia today. If those sentences are available, I should be able to switch quickly. Because many of us code-switch based on the context. At home, I speak in the dialect and all other places, I speak in the standard/formal variant. Tech doesn’t need to flatten that entirely. Since I speak several languages with varied degrees of fluency, I’d really request to think about the plurality and not assume based on dominant and/or western languages. Also, a feature like this would not add visual/UI complexity but would only do justice to the volunteer effort of users since volunteering comes at a very high cost in many parts of the world.

Topic		Replies	Views
Help preserving dialects from vanishing by allowing to add a dialect flag to spoken language Common Voice	16	1931	February 10, 2020
Region or dialect Common Voice l10n	2	1490	May 11, 2019
How to "un-bias" a language? Common Voice	11	894	March 7, 2021
Arabic dataset and variants Common Voice participation , sentence-collection	4	1696	December 22, 2019
Ask Me Anything (AMA) session on Common Voice Variants for Languages Common Voice participation	8	2310	January 24, 2022

[Feature request] Dialects/language variants

Related topics