Continuing the discussion from Feedback needed: Languages and accents strategy I would like to share with you the latest version of the strategy that has been worked out during the past months together with expert linguists and a very deep legal review.
This has been a tremendous amount of work from a lot of people, and will help us inform how we collect data, how we distribute it and how we use it to improve our STT model training. Thank you everyone!
Our most immediate next step is to plan how to adapt our platform to capture the data we need based on the recommendations from this strategy.
Common Voice goal: Generate data to train speech recognition engines, with a focus on the main markets Mozilla has presence with its products.
Need: Quality dataset in languages spoken there (text and voice).
How: Machines need to be taught how the written form of a sentence can be predicted from its spoken form.
- In each language, people say words differently based on many factors (for example accent, dialect, sociolect, idiolect). However, all of these categories are fluid, and creating a finite set of “labels” to describe a person’s speech is impossible.
- Consulted phonetic experts agree that (based on experience and research) self-identification can’t be reliable, and this is usually better coded based on geographic location and determining phonetic groups/zones/areas that match application needs.
Main questions for this strategy
- How do we optimize our dataset to capture the wide range of variation in the same language?
- How do we collect data in order to be useful for optimizing STT models for various product needs?
This document is focused on solving the data collection and identification. We want STT trained models to be able to optimize for different pronunciations in the same language and as a result, market our products there.
Context and background
- Allow training of STT models with a specific population in mind (e.g. Scottish English speakers).
- Act as a value offering for anyone looking for a high quality dataset to use with their technology.
- Support very concrete product needs for serving specific user-cases (e.g. Spanish speakers with a south of Spain accent)
We realize that the way that Mozilla has historically identified languages/locales with variants might not always be useful for Common Voice and Deep Speech goals.
For this project needs, we consider a dataset-language a common writing system that contains the same words, grammar and script, acknowledging that non-formal expressions can happen in different territories.
Example: Spanish in Spain and Spanish in Mexico should be both considered “Spanish” at a dataset-language level.
Reference: Unicode languages and scripts.
There are two important considerations we need to make upfront when talking about accents.
Accents are the combination of phonetics, intonation, lexicon and others, this is a fluid concept, not something we can have in a fixed list.
Phonetic differences (different sounds people make when pronouncing words in their language) are the key factor we need to understand to optimize our STT models in markets we will have product focus.
Through our conversation with specialized linguists, we have come to a few realizations:
- Accent unawareness: People are not good at self-identifying how they sound, due bias or lack of knowledge, it’s not really reliable for our work.
- Phonetic cues: The main factor that determines how you sound is the place you lived most of your early life, how long have you been living there and other places you have been living on for an extended period of time.
- There is no perfect formula to determine a person phonetic area, but this is very accurate to determine people who should be in a phonetic area and which ones we shouldn’t to take into consideration.
- When required by our go to market needs, we should work with linguists in each target language to determine phonetic areas/zones/groups that match our needs.
- Having too many phonetic areas can lead into not having enough data to train models, we should start by just a few large areas per language and then divide them if we need to identify in our dataset certain phonetic differences.
Having a great diversity of phonetics in a way that can be used both for product and linguistic needs provide a huge and unique value to the Common Voice Dataset, this can’t be actually accomplished by commercial offerings on the market.
Common Voice will only accept as dataset-languages the ones that follow the language definition previously explained. Phonetic differences within a language won’t be considered different languages, that information will be captured as additional data for the given language (for example Spanish from Spain and Spanish from Mexico should be considered Spanish and then capture the differences through demographics).
We will allow people to identify which languages they consider themselves as native.
Each language should have just one, separate dataset and it shouldn’t contain words or symbols not part of the language or a different script.
Understanding phonetic differences
The following phases describe independent efforts that we will need at different moments, from collecting the data, to organizing this data and using it for our product needs. They are independent from each other, although some of them depend on having data collection done.
We need to make sure we capture information from the speaker that can be used later to make some suggestions on how to use this data for STT training, based on our research we know we need to capture information on the geographical location:
- The place where the person grew up and how long they lived there.
- Other places they have been living and for how long.
We won’t be asking people about their accent anymore. We will use the territory information to codify countries and ask for approximate location for places they have been living and for how long, we’ll capture the approximate latitude and longitude (data that is consistent across time). Implementation should analyze how to request this in a way that respects our privacy principles (like clicking to a place in map).
This information on latitude and longitude will be distributed along with the Common Voice dataset for anyone to use.
We might need to migrate existing accent data and translate that into one specific location for people to review, or ask everyone again to fill their demographics.
- Location data will always be optional.
- We’ll ask about places where they have lived more than 5 years (we can offer ordering) but we won’t capture more detailed data that can allow to create a profile timeline.
- When asking for location, the information stored will be a generic lat/long (called “Region location”) from the region they select, not the exact lat/long captured (through map clicking or other). So if someone selects any location within Texas, we’ll capture a generic Texas lat/long for everyone.
- We won’t publish information for regions unless they have at least 100 speakers.
- We won’t capture gender for countries that can result into legal risks.
After we have collected geographic information from users, we will work with linguists to determine when the data we have from a person is more or less reliable to use. We need to remember there is no perfect formula to determine someone’s phonetic classification, we’ll use a proxy to cover our needs.
In general, if a person was born and lived their 20 first years in a place, that’s a strong factor to place this person there, but we should also see if they have been living for a long time also in other places.
This formula will need work with linguists, and we acknowledge there is no perfect solution, just a proxy for our needs. We will probably have more clarity once we have started to capture location and can see some trends and stats about them.
This will determine how we store location data in our dataset.
Go to market
Once we have all the data we need in our dataset, and depending on Mozilla’s needs on go to market, we should work with linguists specialized in phonetics in those languages to understand the different phonetic areas/zones we can draw based on our needs. Not all languages will need this phase and this will only be used internally for product/market needs.
Example: “What are the different phonetic areas in the UK to identify in our dataset English speakers with Scottish phonetics?”
When defining phonetic areas we want to follow these principles:
- P1: Useful for STT training with product market needs in mind.
- P1: Requiring limited intervention (crowdsourced)
- P2: Scaleable to all languages where we can train a STT model.
We shouldn’t be worried about improving or increasing these areas in the future, because having the location data will allow us to place speakers without asking for more data again in the future.
Each phonetic area will contain which locations are part of them, so we can determine a speaker phonetic area when needed.
Ideally we will use the phonetic area information to train models focused on specific markets. Our dataset will be able to be used to optimize for these audiences and if it doesn’t perform well enough, we can always create more areas for better identification.
Depending on how much data we have this will be possible or it will signal a need to collect more speakers from specific locations.
- Tagliamonte, Sali A. (2012) Variationist Sociolinguistics: Change, Observation, Interpretation. Sussex: Wiley-Blackwell
- Luk, J. C. (1998). Hong Kong students’ awareness of and reactions to accent differences.
- Kopylovskaya, M. Y., & Dobrova, T. Y. (2017). Dealing With “West-East” Cultural Divide: The Problem Of Cultural Unawareness In Esp For International Relations. Journal of Teaching English for Specific and Academic Purposes, 4(3), 505-516.
- Baker, W., Eddington, D., & Nay, L. (2009). Dialect identification: The effects of region of origin and amount of experience. American Speech, 84(1), 48-71.
- Findlay, A. M., Hoy, C., & Stockdale, A. (2004). In what sense English? An exploration of English migrant identities and identification. Journal of Ethnic and Migration Studies, 30(1), 59-79.
- Williams, A., Garrett, P., & Coupland, N. (1999). Dialect recognition. I: Preston, Dennis R.(red.), Handbook of perceptual dialectology.