Hey Common Voice Community,
Save the date !
We would like to invite you to our Ask Me Anything (AMA) session on the New Common Voice Language Variants with Francis, Lingustic Advisor for Common Voice. Taking place on the 24th January, 2-3pm UTC.
See timezones: January 24, 2022 ā January 24, 2022
Background: Language, Variants and Accents
We want to make Common Voice more lingusticaly inclusive we are inviting Communities to take part in determining variants for their langauges. Learn more about the the inclusion of variants onto Common Voice on our blog.
To support language communities in submitting their suggestions, please review the community guidance. Please read the guidelines in full and then discuss with your community groups.
Once your language community has discussed and decided on which variants youād like to support, please submit your choices via this google form before the 31st January 23:00 UTC.
You can pre-submit questions from Friday 20th January 10:30am UTC
Any questions, we are unable to answer live will be followed up with on a later date. Please abide by the Community Participation Guidelines, when proposing questions.
We look forward to answering your questions Any questions not answered within the hour, will be followed up.
Question 1: How will it be incorporated into the software? Will it divide current language datasets or will it be an option like accent, selected by the user? How will this affect already existing data?
Francisā response
It will be an option like accent, which is selected by the user on their profile page. Another column in the database. Previous releases will not have the variant annotation added. Contributors will be able to change variants in their profile, much like they can change accent/s, but previous dataset releases will not be retroactively changed.
Variantās will be available in peoples profile page only for now but later in the year we will likely expand to the speak interface.
Question 2: In some variants the text-corpus is also affected. How will you handle these?
Francisā response
This is true, in many of the cases as well as being spoken variants there will be written variants and these two will be related. We understand that written variants are also important and we intend to look at that at some point this year. There will be plenty of opportunity for community members express their interest in this so watch this space!
Question 3: In my understanding, we are trying to build models which can understand even foreign speakers (L1+L2+L3). How can we use variant info - except for testing perhaps?
Francisā response
If the question is how should variants be applied to L2+ speakers, this will be down to the individual user, they will have the option of specifying an accent and also a variant. For example a French speaker who is speaking Mexican Spanish would use Mexican Spanish as the (hypothetical) variant, but they would be free to specify their accent as a French accent.
There are a few ways I could imagine the new metadata would be used, and probably many more that I canāt imagine. Aside from testing (as you note), it could be used for balancing training data, or doing multi-task learning. So far there has been little research into this, but the Common Voice dataset will enable that and hopefully improve speech recognition for everyone!
Question 4: With the mainly synthetic borders of nation-states of the pre-21th century, the concept is very politic and even that changes in time so does the language/variant itself. Variants are mainly location-based and with borders in hand, L2 languages got mixed with other L1/L2 languages of that area. (Ex: Turkish on Balkans got mixed with other Balkan languages). There is no hard line between a new language - a variant or accent, how can it be decided?
Francisā response
Disclaimer: Iām not part of the Mozilla team, so canāt say anything about legal issues about countries and borders etc.
You are absolutely right, geography and territory is political, and every system that tries to categorise language by geography is inherently political. Even beyond that, variants can capture other kinds of variation such as cultural or historical.
We acknowledge this, but also need to ensure the dataset is interoperable and scalable - codes are a simple way to make sure MCV is easy to use. BCP-47 was chosen because it is arguably the most flexible and customisable convention around, and would give the community a lot of control. In the rare situation that BCP-47 couldnāt accomodate a variant, we would work with the community to support them to find another way to express it.
Question 5: Hi, Iām a valencian collaborator in the catalan language team ... view full question: https://discourse.mozilla.org/t/ask-me-anything-ama-session-on-common-voice-variants-for-languages/91251/8?u=heyhillary
Francisā response
Bon dia! There are two parts to this question, the first is what should the contributor do, and the second is what is my opinion of what the contributor
should do. So firstly, this should be defined in the validation criteria. We have guidelines about that and I would encourage the Catalan community to localise the guidelines coming up with the criteria you think best. As far as my opinion goes, I am very against the idea of having people speak out sentences that they would not normally speak, or having to fake an accent in order to read the sentence. My advice in this case to the contributor would be to either skip the sentence, or to speak it out how they would normally speak it out. The objective of Common Voice is to teach computers how people speak, not to make people speak how computers will understand them.
Question 6: This area seems to be best decided by linguists, not the general language community. Not every community has linguists in them. How will you decide on the final divisions? BCP-47 can be very fine-grained and you may get a huge list and/or it may be unbalanced between languages.
Francisā response
The purpose of adding variant codes is to allow people to identify with a particular variant of the language they speak. We heard from the language communities that they did not feel the choice between language and accent was a fair or sufficient one. We are also striving to improve the metadata for dataset consumers and, as accents were becoming a rich freeform field, there was risk of losing broad categories of variation.
In terms of the final divisions, we will give communities the first chance to express their preferences. If we donāt agree, it will be on us to explain why, for example explain possible adverse consequences etc. In this case we will raise it with the community and start a conversation about the next steps. Youāre right that BCP -47 is can be very expressive, but we are dealing with a subset of the standard, and for most communities specifying a few (no more than ten) variants will be a good starting point. We expect that it will be unbalanced between languages, some languages will exhibit a lot of encoded variation, others little. What is important is that the ability to encode the variation is there for the language communities that want it. And an important final point is that variants are optional, the functionality is there for the benefit of language communities. If a community decides they have no variants, or if they only want one variant or if they need many variants, we are here to support that.