Dialects or Language Varaiants

Karim_Piar · December 28, 2024, 2:53pm

I am contributing Burushaski language data to the Common Voice project, and I want to ensure fair representation of all its dialects. As a speaker of the Hunza Burushaski dialect, most of the data I have submitted so far reflects only my variety of the language. However, Burushaski has four other dialects that are currently not represented in the project.

I am using a writing system that has been agreed upon by speakers of all dialects, which makes it possible to include data from other varieties as well. To achieve this, I aim to encourage poets and writers from other dialect groups to participate and contribute their voices and texts.
Could you provide me with guidance on how the Common Voice project handles such situations to ensure inclusivity? Also, how can I effectively engage speakers of other Burushaski dialects and convince them to contribute?

bozden · December 29, 2024, 1:59pm

Hey @Karim_Piar, Common Voice has support for Variants which support sentence variants (e.g. different scripts and/or effects of other languages due to geographic region, etc.) and speech style (dialect). There is also “Accent” support, which can be pre-defined AND free-form.

Those should be worked on and included to the system through a PR on GitHub thou. That includes BCP-47 language coding, probably you need to check it. If you have linguists in the community and/or you can reach one from an university, that would be best. Especially the distribution of language speakers among countries (India?) can be important.

Here is what I did for Circassian languages - a rather complex one due to diaspora and transliteration. Check this GitHub PR. And this is for accents.

When these are defined, they will be visible in forms. E.g.

Profile form will show variants and a a preset list of accents (if also defined - and it will be still freeform for cases like foreign speaker accents)
Sentence addition workflows will include variant selection (to support script changes like Arabic/Latin/Cyrillic, written form/meaning changes due to dialect, and/or loanwords due to being a non-national language for example).

Hanzo_Zone · January 3, 2025, 5:16am

The Common Voice project is designed to collect diverse speech data in multiple languages and dialects to create accurate and inclusive speech recognition models. To ensure fair representation of all Burushaski dialects, here’s how the project generally handles such situations and some strategies you can use to engage other dialect communities:

Common Voice’s Approach to Inclusivity

Dialect Representation : The Common Voice project encourages the collection of data from a wide variety of speakers, including those from different dialects of a language. When submitting data for a language like Burushaski, it’s essential to specify which dialect the data represents, so that the project can ensure diverse contributions. You can also set up separate categories for different dialects, if applicable, to allow for better representation.
Validation and Categorization : For dialectal diversity, data may be manually reviewed or categorized based on speaker regions or dialects. It’s important that contributors from different dialects are able to self-identify and label their submissions appropriately.
Encouraging Participation : Common Voice thrives on community contributions. The more diverse the contributors, the better the representation of the language in the dataset. This can be achieved by ensuring that speakers from all dialects feel welcomed to participate.

Engaging Other Burushaski Dialects

To encourage speakers of other Burushaski dialects to contribute, consider the following steps:

Create Awareness : Use social media, local community groups, and platforms where Burushaski speakers from different regions are active to raise awareness about the importance of contributing to the project. Highlight how it benefits their dialect and helps preserve their language through modern technology.
Collaborate with Local Poets and Writers : Since you mentioned poets and writers, reaching out to them is an excellent way to tap into community leaders who can influence others. Poets and writers often have strong ties to their communities and may be more motivated to participate, knowing they are contributing to the preservation of their dialect in a widely-used technological project.
Work with Local Organizations : Partner with cultural organizations, language advocacy groups, or educational institutions that are invested in the preservation of Burushaski dialects. They can help with organizing events or campaigns that promote involvement in the Common Voice project.
Highlight the Benefits : Emphasize the long-term benefits of contributing to the project, such as the ability to improve speech recognition technology for Burushaski speakers, helping preserve the language, and enhancing accessibility for people in the Burushaski-speaking community.
Provide Resources : Ensure that those interested in contributing feel supported. Provide clear instructions, examples of what kind of content is needed, and perhaps a sample script to get them started. Offering technical assistance or troubleshooting can also make the process smoother.

By actively involving speakers from all Burushaski dialects, you can help ensure that the Common Voice dataset truly represents the linguistic diversity of the Burushaski language, making it more useful for speech recognition tools and contributing to language preservation.