Ask Me Anything (AMA) session on Common Voice Variants for Languages

Hey Common Voice Community,

Save the date !

:partying_face: We would like to invite you to our Ask Me Anything (AMA) session on the New Common Voice Language Variants with Francis, Lingustic Advisor for Common Voice. Taking place on the 24th January, 2-3pm UTC.

See timezones: January 24, 2022January 24, 2022

Background: Language, Variants and Accents

We want to make Common Voice more lingusticaly inclusive we are inviting Communities to take part in determining variants for their langauges. Learn more about the the inclusion of variants onto Common Voice on our blog.

To support language communities in submitting their suggestions, please review the community guidance. Please read the guidelines in full and then discuss with your community groups.

Once your language community has discussed and decided on which variants you’d like to support, please submit your choices via this google form before the 31st January 23:00 UTC.

You can pre-submit questions from Friday 20th January 10:30am UTC

Any questions, we are unable to answer live will be followed up with on a later date. Please abide by the Community Participation Guidelines, when proposing questions.

We look forward to answering your questions :sparkles: Any questions not answered within the hour, will be followed up.


Question 1: How will it be incorporated into the software? Will it divide current language datasets or will it be an option like accent, selected by the user? How will this affect already existing data?

Francis’ response

It will be an option like accent, which is selected by the user on their profile page. Another column in the database. Previous releases will not have the variant annotation added. Contributors will be able to change variants in their profile, much like they can change accent/s, but previous dataset releases will not be retroactively changed.

Variant’s will be available in peoples profile page only for now but later in the year we will likely expand to the speak interface.


Question 2: In some variants the text-corpus is also affected. How will you handle these?

Francis’ response

This is true, in many of the cases as well as being spoken variants there will be written variants and these two will be related. We understand that written variants are also important and we intend to look at that at some point this year. There will be plenty of opportunity for community members express their interest in this so watch this space!


Question 3: In my understanding, we are trying to build models which can understand even foreign speakers (L1+L2+L3). How can we use variant info - except for testing perhaps?

Francis’ response

If the question is how should variants be applied to L2+ speakers, this will be down to the individual user, they will have the option of specifying an accent and also a variant. For example a French speaker who is speaking Mexican Spanish would use Mexican Spanish as the (hypothetical) variant, but they would be free to specify their accent as a French accent.

There are a few ways I could imagine the new metadata would be used, and probably many more that I can’t imagine. Aside from testing (as you note), it could be used for balancing training data, or doing multi-task learning. So far there has been little research into this, but the Common Voice dataset will enable that and hopefully improve speech recognition for everyone!


Question 4: With the mainly synthetic borders of nation-states of the pre-21th century, the concept is very politic and even that changes in time so does the language/variant itself. Variants are mainly location-based and with borders in hand, L2 languages got mixed with other L1/L2 languages of that area. (Ex: Turkish on Balkans got mixed with other Balkan languages). There is no hard line between a new language - a variant or accent, how can it be decided?

Francis’ response

Disclaimer: I’m not part of the Mozilla team, so can’t say anything about legal issues about countries and borders etc.

You are absolutely right, geography and territory is political, and every system that tries to categorise language by geography is inherently political. Even beyond that, variants can capture other kinds of variation such as cultural or historical.

We acknowledge this, but also need to ensure the dataset is interoperable and scalable - codes are a simple way to make sure MCV is easy to use. BCP-47 was chosen because it is arguably the most flexible and customisable convention around, and would give the community a lot of control. In the rare situation that BCP-47 couldn’t accomodate a variant, we would work with the community to support them to find another way to express it.


Question 5: Hi, I’m a valencian collaborator in the catalan language team ... view full question: https://discourse.mozilla.org/t/ask-me-anything-ama-session-on-common-voice-variants-for-languages/91251/8?u=heyhillary

Francis’ response

Bon dia! :slight_smile: There are two parts to this question, the first is what should the contributor do, and the second is what is my opinion of what the contributor

should do. So firstly, this should be defined in the validation criteria. We have guidelines about that and I would encourage the Catalan community to localise the guidelines coming up with the criteria you think best. As far as my opinion goes, I am very against the idea of having people speak out sentences that they would not normally speak, or having to fake an accent in order to read the sentence. My advice in this case to the contributor would be to either skip the sentence, or to speak it out how they would normally speak it out. The objective of Common Voice is to teach computers how people speak, not to make people speak how computers will understand them.


Question 6: This area seems to be best decided by linguists, not the general language community. Not every community has linguists in them. How will you decide on the final divisions? BCP-47 can be very fine-grained and you may get a huge list and/or it may be unbalanced between languages.

Francis’ response

The purpose of adding variant codes is to allow people to identify with a particular variant of the language they speak. We heard from the language communities that they did not feel the choice between language and accent was a fair or sufficient one. We are also striving to improve the metadata for dataset consumers and, as accents were becoming a rich freeform field, there was risk of losing broad categories of variation.

In terms of the final divisions, we will give communities the first chance to express their preferences. If we don’t agree, it will be on us to explain why, for example explain possible adverse consequences etc. In this case we will raise it with the community and start a conversation about the next steps. You’re right that BCP -47 is can be very expressive, but we are dealing with a subset of the standard, and for most communities specifying a few (no more than ten) variants will be a good starting point. We expect that it will be unbalanced between languages, some languages will exhibit a lot of encoded variation, others little. What is important is that the ability to encode the variation is there for the language communities that want it. And an important final point is that variants are optional, the functionality is there for the benefit of language communities. If a community decides they have no variants, or if they only want one variant or if they need many variants, we are here to support that.

2 Likes

OK, let me start @ftyers . I’m no expert (on anything), read about the topic on the last few days and I’m a bit confused on the approach. So these questions will be quite basic…

  1. This area seems to be best decided by linguists, not the general language community. Not every community has linguists in them. How will you decide on the final divisions? BCP-47 can be very fine-grained and you may get a huge list and/or it may be unbalanced between languages.
  2. How will it be incorporated into the software? Will it divide current language datasets or will it be an option like accent, selected by the user? How will this affect already existing data?
  3. In some variants the text-corpus is also affected. How will you handle these?
  4. Most of the datasets are low-resourced and have a low number of speakers. How would splitting them into smaller sets help? Will they be statistically significant? And if you leave it to the user will it be accurate?
  5. In my understanding, we are trying to build models which can understand even foreign speakers (L1+L2+L3). How can we use variant info - except for testing perhaps?
  6. With the mainly synthetic borders of nation-states of the pre-21th century, the concept is very politic and even that changes in time so does the language/variant itself. Variants are mainly location-based and with borders in hand, L2 languages got mixed with other L1/L2 languages of that area. (Ex: Turkish on Balkans got mixed with other Balkan languages). There is no hard line between a new language - a variant or accent, how can it be decided?

These are basic but not easy to answer globally I think…

Thank you (and waiting to work on these with you for Turkish)…

3 Likes

Hi everyone! Hola a todos! Olá a tudos! Salut a tous! Привет всем! Hei alle! :wave:

I’m Francis Tyers, a linguistic advisor working with the Common Voice team. My expertise is in language technology development for Indigenous and marginalised languages. I’ve been working with the Common Voice project for a number of years now and take a special interest in helping new communities get started. I’ve worked on the language variant strategy and look forward to hearing and answering your questions! :slight_smile:

PS. Thanks @bozden for the questions you have asked already!

3 Likes

Hello, my name is Marcelo, I’m the manager for pt-BR at Pontoon.
This variants initiative is welcome for us, since the Portuguese language in Brazil has many written and spoken differences from Portugal and other countries.
I believe a single pt-BR variant will be sufficient for us.
I would like to know if the variants will be applied only in voice datasets, or also in the localization of the Common Voice site.

3 Likes

Hi, I’m a valencian collaborator in t he catalan language team. I have some ideas about how to handle the variants issue in my language, keeping in mind that the catalan language has a polycentric normative, that is, we have morphological variants mainly affecting the verbal system. We can distingish three variants for many simple present forms, for example. This poses some problems to willing speakers for the project. Some of them (mostly valencian ones) tend to change the text and speak the “valencian” form, but this should be considered bad practice. I’d like info about collaborating to stress the fact that speakers should read what they have before their eyes. Providing other verbal forms is a matter of providing more variety of texts (a task which, btw, I’m involved in as a valencian). From my point of view, speakers should always read with their natural accent whatever text they are shown, and texts should offer suitable variants. I don’t find it acceptable to let the speaker interpret and decide about the text regarding this topic. What is your opinion about this?

3 Likes

Hi everyone, thanks ever so much for your questions! :slight_smile: I’m still working on responding to all of them. If you have any further questions, you can ask here or on Matrix. My door is always open :slight_smile:

1 Like