Common Voice languages and accent strategy v5

Continuing the discussion from :speaking_head: Feedback needed: Languages and accents strategy I would like to share with you the latest version of the strategy that has been worked out during the past months together with expert linguists and a very deep legal review.

This has been a tremendous amount of work from a lot of people, and will help us inform how we collect data, how we distribute it and how we use it to improve our STT model training. Thank you everyone!

Our most immediate next step is to plan how to adapt our platform to capture the data we need based on the recommendations from this strategy.

Executive summary

Common Voice goal: Generate data to train speech recognition engines, with a focus on the main markets Mozilla has presence with its products.

Need: Quality dataset in languages spoken there (text and voice).

How: Machines need to be taught how the written form of a sentence can be predicted from its spoken form.

Known limitations

  • In each language, people say words differently based on many factors (for example accent, dialect, sociolect, idiolect). However, all of these categories are fluid, and creating a finite set of “labels” to describe a person’s speech is impossible.
  • Consulted phonetic experts agree that (based on experience and research) self-identification can’t be reliable, and this is usually better coded based on geographic location and determining phonetic groups/zones/areas that match application needs.

Main questions for this strategy

  • How do we optimize our dataset to capture the wide range of variation in the same language?
  • How do we collect data in order to be useful for optimizing STT models for various product needs?

This document is focused on solving the data collection and identification. We want STT trained models to be able to optimize for different pronunciations in the same language and as a result, market our products there.

Context and background

Our goals

  • Allow training of STT models with a specific population in mind (e.g. Scottish English speakers).
  • Act as a value offering for anyone looking for a high quality dataset to use with their technology.
  • Support very concrete product needs for serving specific user-cases (e.g. Spanish speakers with a south of Spain accent)


We realize that the way that Mozilla has historically identified languages/locales with variants might not always be useful for Common Voice and Deep Speech goals.

For this project needs, we consider a dataset-language a common writing system that contains the same words, grammar and script, acknowledging that non-formal expressions can happen in different territories.

Example: Spanish in Spain and Spanish in Mexico should be both considered “Spanish” at a dataset-language level.

Reference: Unicode languages and scripts.


There are two important considerations we need to make upfront when talking about accents.

  1. Accents are the combination of phonetics, intonation, lexicon and others, this is a fluid concept, not something we can have in a fixed list.

  2. Phonetic differences (different sounds people make when pronouncing words in their language) are the key factor we need to understand to optimize our STT models in markets we will have product focus.


Through our conversation with specialized linguists, we have come to a few realizations:

  • Accent unawareness: People are not good at self-identifying how they sound, due bias or lack of knowledge, it’s not really reliable for our work.
  • Phonetic cues: The main factor that determines how you sound is the place you lived most of your early life, how long have you been living there and other places you have been living on for an extended period of time.
  • There is no perfect formula to determine a person phonetic area, but this is very accurate to determine people who should be in a phonetic area and which ones we shouldn’t to take into consideration.
  • When required by our go to market needs, we should work with linguists in each target language to determine phonetic areas/zones/groups that match our needs.
  • Having too many phonetic areas can lead into not having enough data to train models, we should start by just a few large areas per language and then divide them if we need to identify in our dataset certain phonetic differences.

Having a great diversity of phonetics in a way that can be used both for product and linguistic needs provide a huge and unique value to the Common Voice Dataset, this can’t be actually accomplished by commercial offerings on the market.



Common Voice will only accept as dataset-languages the ones that follow the language definition previously explained. Phonetic differences within a language won’t be considered different languages, that information will be captured as additional data for the given language (for example Spanish from Spain and Spanish from Mexico should be considered Spanish and then capture the differences through demographics).

We will allow people to identify which languages they consider themselves as native.

Each language should have just one, separate dataset and it shouldn’t contain words or symbols not part of the language or a different script.

Understanding phonetic differences

The following phases describe independent efforts that we will need at different moments, from collecting the data, to organizing this data and using it for our product needs. They are independent from each other, although some of them depend on having data collection done.

Data collection

We need to make sure we capture information from the speaker that can be used later to make some suggestions on how to use this data for STT training, based on our research we know we need to capture information on the geographical location:

  • The place where the person grew up and how long they lived there.
  • Other places they have been living and for how long.

We won’t be asking people about their accent anymore. We will use the territory information to codify countries and ask for approximate location for places they have been living and for how long, we’ll capture the approximate latitude and longitude (data that is consistent across time). Implementation should analyze how to request this in a way that respects our privacy principles (like clicking to a place in map).

This information on latitude and longitude will be distributed along with the Common Voice dataset for anyone to use.

We might need to migrate existing accent data and translate that into one specific location for people to review, or ask everyone again to fill their demographics.

Legal considerations
  • Location data will always be optional.
  • We’ll ask about places where they have lived more than 5 years (we can offer ordering) but we won’t capture more detailed data that can allow to create a profile timeline.
  • When asking for location, the information stored will be a generic lat/long (called “Region location”) from the region they select, not the exact lat/long captured (through map clicking or other). So if someone selects any location within Texas, we’ll capture a generic Texas lat/long for everyone.
  • We won’t publish information for regions unless they have at least 100 speakers.
  • We won’t capture gender for countries that can result into legal risks.

Data bucketing

After we have collected geographic information from users, we will work with linguists to determine when the data we have from a person is more or less reliable to use. We need to remember there is no perfect formula to determine someone’s phonetic classification, we’ll use a proxy to cover our needs.

In general, if a person was born and lived their 20 first years in a place, that’s a strong factor to place this person there, but we should also see if they have been living for a long time also in other places.

This formula will need work with linguists, and we acknowledge there is no perfect solution, just a proxy for our needs. We will probably have more clarity once we have started to capture location and can see some trends and stats about them.

This will determine how we store location data in our dataset.

Go to market

Once we have all the data we need in our dataset, and depending on Mozilla’s needs on go to market, we should work with linguists specialized in phonetics in those languages to understand the different phonetic areas/zones we can draw based on our needs. Not all languages will need this phase and this will only be used internally for product/market needs.

Example: “What are the different phonetic areas in the UK to identify in our dataset English speakers with Scottish phonetics?”

When defining phonetic areas we want to follow these principles:

  • P1: Useful for STT training with product market needs in mind.
  • P1: Requiring limited intervention (crowdsourced)
  • P2: Scaleable to all languages where we can train a STT model.

We shouldn’t be worried about improving or increasing these areas in the future, because having the location data will allow us to place speakers without asking for more data again in the future.

Each phonetic area will contain which locations are part of them, so we can determine a speaker phonetic area when needed.

Ideally we will use the phonetic area information to train models focused on specific markets. Our dataset will be able to be used to optimize for these audiences and if it doesn’t perform well enough, we can always create more areas for better identification.

Depending on how much data we have this will be possible or it will signal a need to collect more speakers from specific locations.


Accent unawareness

Phonetic cues

  • Baker, W., Eddington, D., & Nay, L. (2009). Dialect identification: The effects of region of origin and amount of experience. American Speech, 84(1), 48-71.
  • Findlay, A. M., Hoy, C., & Stockdale, A. (2004). In what sense English? An exploration of English migrant identities and identification. Journal of Ethnic and Migration Studies, 30(1), 59-79.
  • Williams, A., Garrett, P., & Coupland, N. (1999). Dialect recognition. I: Preston, Dennis R.(red.), Handbook of perceptual dialectology.

Nice. We had put “your birthplace” as an accent in the accents dropdown box for Traditional Chinese in Taiwan (which use City level as options) and Simplified Chinese in China (which use Province as options), according to multiple linguistic’s suggestions when the “accents” dropdown options up. It turned out to be the right direction and could be rather easy to convert to the new format we will use.

One thing to consider is the location accuracy of each language should be different. In above example, Taiwan is a smaller island, so we think “City of birth” as a good level of accents data, however, we use Province as the data options for China, which is much bigger but accurate enough for such big country (otherwise we will have thousands of locations for people to choose.)

Perhaps it won’t be a problem as the plan is to ask people to click on the map. I assume we will need a rather detailed boundary geojson data for all cities of the world?

The idea is to convert the point you signal (we don’t know yet if a map or other) and turn that into a generic location for the region you are. Region understood as a subdivision of a country level (it might be called region, province or state in different countries), so these are broad.

Note that this information might end being used in different ways depending on the needs.

It might be the case that some regions from different nearby countries are used together for a model training in language X, or maybe data from one single region has enough information to train a model we need in order to optimize for the phonetic differences there.

The good thing is that we’ll have the data captured already and we can decide how to use it later.

It may be hard to select a “standard resolution” of converted location data for every countries. Do we want to convert it into a city level (probably 100km accuracy?), district level (10km resolution?) or states/provinces level (500km accuracy?)

The decision would be too detail to be dis-identify (10km for Mid US for example) or too rough for researching and developing (500km for Taiwan?)

Another simple way is to provide a dropdown box as “birthplace”, and ask contributor from every language to provide the data they think is suitable for voice applications, so we can decide to put either a cities list or states list based on different region’s status, and convert to lat/log after user select from dropdown.

If we store the data as latitude/longitude we are completely independent from whatever countries and regions decide for change their divisions. Maybe there is a way to define generic lat/lon each number of kms, and then assign people to the closest one.

This way we have a map with points (stored as numerical values), and when using the data we can groups them whatever we need at different moments.

What will happen to the old data that is already collected?

We’ll need to find a way to turn it into the new format (actual numeric locations), we’ll probably ask for community help to define a generic location for each of the current “accents”.

If you’re assigning arbitrary geographic coordinates, it’s best to set them to places where you’re certain no-one lives, such as in the middle of a river. Otherwise this kind of thing can happen:

Interesting, although I don’t know if we’ll have the capability to understand which lat/lot are under places where people live or not. It seems like a non-trivial task.

Maybe we can use Plus Codes instead of lat/lon. It’s divided earth by 20x20 grid, a code indicate a rectangle region, rather easy to convert and adjust the precision, and won’t be able to convert back to a specific Lat/Lon point.


  • is a 14m x 14m rectangle area (of the river near my house),
  • 7QQ32G is a 5km^2 area cover 1/4 of metro Taipei city, and
  • 7QQ3 is the code for 100km^2 rectangle area (covered most part of northern Taiwan and a big area of ocean nearby).
  • 7Q is code for 2000km^2 area covered whole Taiwan, half of Philippine and Okinawa Japan.

We ask people to point the place on map, and we store only front 6 digit (in 5x5km rectangle) or 4 digit (in 100km^2 accuracy, little too big for me).


Yes, that’s a good idea because it’s obvious by looking at it that the Plus Code is an area and not an exact pinpointed location. The problem MaxMind had was that they were using a precision marker to represent a broad area.