🗣 Feedback needed: Languages and accents strategy

Hello everyone,

I would like to open this topic to collect feedback from all our communities and partners. This topic contains a proposal crafted by the Common Voice and Deep Speech staff teams based on conversations with different volunteers and linguistics experts.

We will keep this topic open for feedback until May 26th. After that the same team will gather the input and create a final version.

We will seek consensus and agreement from most people involved but the ultimate decision maker will be George Roter (@george)

What we want to know from you:

  • Does the proposal resonate with you or your language?
  • Do you have any flags?
  • If so, what’s the issue and why is it important?

Thanks for your comments!

Context and background


We realize the way Mozilla has historically identified languages/locales with variants might not always be useful for Common Voice and Deep Speech goals.

We consider a language a combination of a common writing system that contains the same words and grammar, acknowledging that non-formal expressions can happen in different places.

Each language should have just one data-set, and it shouldn’t contain words that are not part of the language (different symbols or scripts).


We consider an accent the combination of intonation (sound) + phonetic (spoken letters).

Accents are usually coming from places, having for example different accents in different cities where the same language is spoken.

For Deep Speech the more concrete, the better. Having information about the location of an accent is super useful, the following list details from less useful to more useful the information we are looking for:

:frowning_face: No data < I don’t know < Country < Region < City :slight_smile:

We should seek for the most concrete location for an accent, ideally, nearest city.

When a language is not your native one, your accent can come up from your native language or if you are close to native, it can come up from a location.

We should allow people to self-identify where their accents are coming from.

Languages strategy proposal

Common Voice will only accept as languages the ones that follow the language definition previously explained. We won’t allow language variants to be considered different languages, that information will be captured as accents.

We will allow people to identify which languages their consider themselves as native.

Accent strategy proposal

We will ask people for their accents in each language.

For languages they identified as native we will ask “Where does your accent comes from? (select the nearest place)”, options:

  • I don’t know
  • Country list
  • Regions list
  • Cities list

We want to encourage people to select the most concrete location, preferably their nearest city (we would like to provide autocompletion from a list of known world cities so this is not free text).

For languages not identified as native we will ask “Where does your accent comes from? (select the nearest place)”, options:

  • I don’t know
  • My native language
  • Country list
  • Regions list
  • Cities list
1 Like

Great that you are advancing this topic!

I would however think that whether I am native or non-native and what my native language is, and whether it really had much of an influence might not be that easy to answer.

Think about people who were born and who lived in lots of Arabic countries, so their native tongue could be considered Arabic, but they studied in British or American boarding schools and learnt the language from native speakers and spoke it exclusively throughout most of their youth. American or British English is not their native language, but still most people who listen to them might think it is. So which box do they tick? Native American English? Non-native Arabic? A mixture?

Also not uncommon appears to be when people learn one non-native language first and bring its pronunciation into a second non-native language. Consider people who speak Persian with a thick Gulf Arabic accent, but actually they are German and live there. And remember that Gulf Arabic sounds nothing like e.g. Maghreb Arabic, so it’s not even as simple as ticking “Arabic”. So Persian is not their native tongue, and what influenced them was not their native tongue either, but some third language. Which boxes do they tick?

And finally what about passport-wise native speakers, who lived with their families abroad? Their passport says they are British or American, but they carry unmistakably some aspects of a Spanish accent? Which boxes do they tick?

I would suggest to step away from a native / non-native binary and present everyone with the same simple question and options: What influenced your accent?

  • I don’t know
  • list of:
    • language + proficiency + location

What about different words in different accents? To give an example, Dutch (Netherlands) and Flemish (Belgium), share 98% of their vocabulary. 1% simply has a different meaning (shouldn’t be a problem), but the other 1% are words that are local and sometimes unknown to the listener.

Some of these words are already in the data now and when I hear someone of the other accent pronounce them they are really wrong and might distort the precision. The same actually holds for things like (place)names.

Therefore it would be good if some words/sentences could be flagged under one of the accents. In that way, the other groups could skip it if they don’t recognise it. That is actually how the dictionaries also work.

We want people to self- assess themselves and decide based on what they think. If they are unsure or don’t want to share, they can always leave it blank.

1 Like

How many of these words are used in formal language? Can we consider them slang?

Ideally the sentences dataset should not include a lot of slang words and if I encounter these words once in a hundred times is not a big issue or terrible experience. We don’t have a perfect model and you will always encounter sentences with minor issues.

If this is a major problem we can always gather a list with all these words and filter them out if needed.

1 Like

I’m assuming if I selected a city it would set the country and region fields automatically so that it was still searchable with broader criteria?

That’s correct, this is one of the lists we have been looking into.

1 Like

(Trying to interpret what Jef meant:) An example is the word “aanrijden”. In most parts of the Netherlands, this would mean to hit somebody with e.g. your car, but in the South, it means that you just got into your car. It’s not slang, but a perfectly regular word that simply has an additional meaning in the South.

I’ve seen similar Flemish constructs when contributing Dutch, and I’m not always sure whether my intonation is the same as a Flemish person would do it, because I never use such constructs.

1 Like

I think it’s generally a good idea to collect as much data as possible (and as much as people agree to), because based on that raw data it might be possible to optimize the model in the future and raw and accurate date is always a good base for any work of this kind.

7 posts were split to a new topic: Privacy concerns about dataset metadata

I pretty agree to use birth places data for accent data, which we currently used in accent menu zh-cn and zh-tw locale. Many language researchers told me that it’s more easy and simple for people to choose and for them to process the data.

The first problem with the dataset is that it’s in English, people don’t used to select the name of the city with English, and the auto-complete system would be a problem to have no suggestion with the local name, especially when we encourage people not used to technologies to contribute their voice.

The second problem is that I had checked the list. it’s only about 20% accuracy in Taiwan compared to the current city list*1 we use, and it only has 800 cities for China which should be more than 2.5K *2. We would need to fork it and maintain it.

*1 https://github.com/mozilla/voice-web/pull/1876/files#diff-a5837602d8f8b2f77869552f3896bbd2
*2 http://big5.www.gov.cn/gate/big5/www.gov.cn/test/2011-08/22/content_1930111.htm

Third, when we discuss which accent list is better for zh-cn, the local community had a long discussion and decided to use the province as birthplace but not city level. There are too many cities and the accent difference across cities are not obvious. Of course the more detail the better but we still should manage a reasonable number for easier maintenance and prevent frequent change necessary (on City level). For China I would think of the second level is more suitable than cities, which we would have about 350 prefectures.

We also want to use an official list with numbering ID from China PID system*3 (which the first 2 digits representing the birthplace) for people to choose easier. Everyone can easily find the options they should select because of the shared system even if they are older people or with little education, they will know how to identify themself. And the voice data is easier to process with the official id.

*3 https://github.com/mozilla/voice-web/pull/1929/files#diff-a5837602d8f8b2f77869552f3896bbd2

Considering the problems, I think it’s better to ask the contributor to raise the best places list for their locale? We only have 28 languages and I don’t think it difficult for each locale to come up with a most suitable list.

Please note that the proposal is not about “birth places” but “where is your accent coming from” and allow people to self-evaluate this.

You can have born in a specific place, but now you are living in another and you recognize your accent is now coming from the new place, or not, that’s why we should allow people to decide for themselves.

Sure, I’m giving the feedback on how to come up with places list, whether it’s self-identified or not works.

Perhaps we should say “select the most appropriate place, if you don’t know how to than select your birthplace” to make it easier for some people.

1 Like

Please consider also accents / a different sound by a cultural background but not on a different location. For example immigrants, that use a different toung e.g. turkish-german, russian-german, …
Same for non-native speakers e.g. german-english, french-english, … which might result in a very different pronounciation … (I already pointed it out at Region or dialect)

Please also ask local translators to find the right / common accents for a language, otherwise users might not be willing to select the right accent if they are unfamilar with the (too generic) labels or can’t find the right entry.

If I understood your request correctly, the current proposal also captures this, since accent for non-native languages can be “My native language”, so for example if you speak French as native and you also have English as second language, you can select your English accent as “coming from my native language” if you consider you have a French accent when speaking in English.


Hi Rubén, thanks for soliciting this feedback! Would you be open to including an accent category (or categories) for people with impaired enunciation from Dysarthria (see e.g. https://www.asha.org/Practice-Portal/Clinical-Topics/Dysarthria-in-Adults/) ?

I’m developing a speech-to-text model for someone with spastic dysarthria, and we’re planning to retrain a DeepSpeech model on our audio/transcription data. So we need to record and transcribe a bunch of audio clips anyway for our training data. When I came across the common voice project (via DeepSpeech) it occurred to me that if we could collect/store our transcription on common voice, then this voice data would be available for everyone, and maybe someone else can come up with an even better model than we do.

I think this population would be a great match for common voice - people who can be understood by other people who know them, but who are not understood by Google/Amazon voice APIs or by humans who are not familiar with their voice.

Hi, I don’t really know how this would work. My understanding is that in order to train a model we need at least 1000 different voices and we should figure out with a linguistic if this can be considered an accent or each person with impaired enunciation would have a different unique one, which would limit us from creating a general model for it.

Hi Rubén, my idea was just to create an accent category or categories, so that the audio/transcription data would be publically available through the common voice dataset. I wouldn’t rely on Mozilla to make a model, I’m training our own voice model anyway (which I would also be happy to share if desired).

On your 1000 different voices concern - I have two responses. First, I think we could definitely find this many people or more with these voice disorders who are interested in contributing their voice to common voice because this is a population for whom existing voice transcription models do not work adequately. Second, suppose even a general ‘spastic dysarthria’ model does not work well for most specific individuals, and so a useful model would generally need additional training data from each individual user (I don’t know yet whether or not this is the case). Even in this case though, the person-specific model would still clearly be useful far sooner, and need far less person-specific training data, if retrained from a pretrained ‘spastic dysarthria’ model, rather than if retraining started from a model previously trained on those with typical enunciation. So any application built on-top of a voice-transcription model aiming to be useful to individuals with voice disorders would be useful far more quickly to each user, with less training time required.

I appreciate your consideration and if you still feel like this is just outside the scope of common voice, I understand. But I do think including a dysarthria accent category (ideally with subcategories by type listed in my previous link) would help facilitate the creation of really useful voice transcription models by the community.

There is also a correlation between accent and the level of studies and bilingualism (influence).
We can also ask people to give their level of study and other spoken languages for better segmentation.