🗣 Feedback needed: Languages and accents strategy

Please consider also accents / a different sound by a cultural background but not on a different location. For example immigrants, that use a different toung e.g. turkish-german, russian-german, …
Same for non-native speakers e.g. german-english, french-english, … which might result in a very different pronounciation … (I already pointed it out at Region or dialect)

Please also ask local translators to find the right / common accents for a language, otherwise users might not be willing to select the right accent if they are unfamilar with the (too generic) labels or can’t find the right entry.

If I understood your request correctly, the current proposal also captures this, since accent for non-native languages can be “My native language”, so for example if you speak French as native and you also have English as second language, you can select your English accent as “coming from my native language” if you consider you have a French accent when speaking in English.

2 Likes

Hi Rubén, thanks for soliciting this feedback! Would you be open to including an accent category (or categories) for people with impaired enunciation from Dysarthria (see e.g. https://www.asha.org/Practice-Portal/Clinical-Topics/Dysarthria-in-Adults/) ?

I’m developing a speech-to-text model for someone with spastic dysarthria, and we’re planning to retrain a DeepSpeech model on our audio/transcription data. So we need to record and transcribe a bunch of audio clips anyway for our training data. When I came across the common voice project (via DeepSpeech) it occurred to me that if we could collect/store our transcription on common voice, then this voice data would be available for everyone, and maybe someone else can come up with an even better model than we do.

I think this population would be a great match for common voice - people who can be understood by other people who know them, but who are not understood by Google/Amazon voice APIs or by humans who are not familiar with their voice.

Hi, I don’t really know how this would work. My understanding is that in order to train a model we need at least 1000 different voices and we should figure out with a linguistic if this can be considered an accent or each person with impaired enunciation would have a different unique one, which would limit us from creating a general model for it.

Hi Rubén, my idea was just to create an accent category or categories, so that the audio/transcription data would be publically available through the common voice dataset. I wouldn’t rely on Mozilla to make a model, I’m training our own voice model anyway (which I would also be happy to share if desired).

On your 1000 different voices concern - I have two responses. First, I think we could definitely find this many people or more with these voice disorders who are interested in contributing their voice to common voice because this is a population for whom existing voice transcription models do not work adequately. Second, suppose even a general ‘spastic dysarthria’ model does not work well for most specific individuals, and so a useful model would generally need additional training data from each individual user (I don’t know yet whether or not this is the case). Even in this case though, the person-specific model would still clearly be useful far sooner, and need far less person-specific training data, if retrained from a pretrained ‘spastic dysarthria’ model, rather than if retraining started from a model previously trained on those with typical enunciation. So any application built on-top of a voice-transcription model aiming to be useful to individuals with voice disorders would be useful far more quickly to each user, with less training time required.

I appreciate your consideration and if you still feel like this is just outside the scope of common voice, I understand. But I do think including a dysarthria accent category (ideally with subcategories by type listed in my previous link) would help facilitate the creation of really useful voice transcription models by the community.

There is also a correlation between accent and the level of studies and bilingualism (influence).
We can also ask people to give their level of study and other spoken languages for better segmentation.

Hi, do you have a link to read more about this? It would be interesting to consider if it’s backed by linguistic experts.

Hi, nice topic.
I agree for the first question about “Native or not”, but I prefer it on the second position, we can begin with something like " Do you think that you have a particular regional or foreign accent ? ". If the response is no, no need more, otherwise we demande if the language is Native or not.
Next I think that there is absolutely no need to have tree levels of details, we will complicate the process for insignificant differences ( if any between cities ).
I propose for non Native speakers contributors to declare their native language and country and for native speakers contributors their country and region. The region must be as large as possible, for example, in France we can propose one of the 13 metropolitan regions instead of departments ( there is 97 ) .

@achraf.khelil thanks for your feedback and welcome to the Common Voice community!

It seems to me that your message is centered in implementation and user experience. I would said that’s something our UX experts will figure out how to make it easy when we get there. In my opinion we shouldn’t be worrying much now, and focus on agreement on the basic needs and strategy.

Cheers.

Thank you for your response, but I’m not focused on UX. I suggested these changes to :

  • Respond on the confusing situations announced by @ dschridde , non native speakers without accent and native speakers with accent.
  • Reduce the data set and improve segmentation performance by bringing it closer to reality. The reality is that there are no different accents between cities, but between regions. And for non native speakers, theirs regions and cities are not important, it’s theirs native language and country which can influence their accent.

I would say that’s a very broad statement. I have some examples in my country where accents change between cities within the same region, so this can be different from one country to another. Also, this is what we heard from our linguistic experts.

The good thing about this proposal is that people can decide if self-identify themselves with country, region, city or none. This should not influence because we have data about where each city is located (region and country) so won’t affect people just selecting regions in countries where there is no difference between cities.

1 Like

In my humble opinion and experience, there are lots of nuances to “accent” in a voice and the words used in communication. It will be difficult to capture all of these nuances with just one location.

However, given that we are dealing with this data primarily for research and development – including machine learning – I think there is value in categorizing contributions by the apparent characteristics of the contribution.

I believe this is best supported by “tags” which augment the primary category. For example, a speaker of English could have their category (“English”) described by multiple tags. #us_southern might appropriately apply to a large group of speakers in the United States, but even people in the same city might sound dramatically different. In Kansas City I think I would apply #us_midwestern to most speakers, but additionally some might be described with #urban and/or #latino even though they grew up within a few miles of each other. There are likely subtleties that would apply to someone #latino from #newyork_bronx, or someone #jewish from #newyork_bronx.

This sort of tagging is a little more complex, but I think the accent difference between geographically near locations in New York are much more pronounced than between Indiana and southern Kansas which are hundreds of miles apart and would both fall under a general #us_midwestern tag.

In less widely spoken languages I can imagine location defines much of this. But I also think a tagging system would serve just as well.

The challenge with this system is keeping the tags under control (avoiding both #southern and #us_southern, for example). But if you allow a dynamic and easily searchable list that could be curated by admins to merge similar tags – including automatic updates to users who had the curated tags – plus if you can keep the data associated with the individual so later refinements of their tags would retroactively be applied to earlier contributions then I think you would have a usable mechanism.

2 Likes

I’m also worried about the complexity of this proposal, we don’t have bandwidth to maintain a curated list of tags or time to have admins patroling duplicates in all languages.

Being more granular than city is probably better, yes, but at some point we need to make a compromise between utility, complexity and resources.

Hi! I’m working on a fully accent-independent way to help everyone with pronunciation remediation for free. Instead of examining whether an accented utterance is or is not “correct” according to a pronunciation expert or panel of judges, we use Nakagawa (2011) and his grad students’ method of trying to predict whether a listener, whether they be a native English speaker or not, would transcribe the utterance as the speech which was supposed to have been said. Please join the Spoken Language Interest Group of the IEEE Learning Technologies Standards Committee at http://bit.ly/slig and follow the main Discourse topic at: Intelligibility remediation

Thank you!

One small point : in your definition of a language in order to avoid many variants of the same language, you make reference to a common writing system. It means that for a language without a stable writing system (and they are numerous) your definition will not be operative; you will not be able to say if it’s a variant of the language with a variant of the writing system, or a new language. If you want to be normative, you may add “if a writing system exists which allows you to express the same words with the same grammar…”

Thanks for the feedback @gadda and welcome to the community!

Can you provide a few languages as examples where this happens?

Thanks!

I see feedback is closing soon. I have time for a few thoughts about writing systems. (There’s much to say about “the same words and grammar”, too.)

Why include a writing system in the definition of a language? In line with @gadda, many languages do not have one writing system. Reasons for this can include:

  1. the language is not (or at least not primarily) written
  2. the language has no standard writing system
  3. the language has historically been written with multiple systems
  4. the language has multiple standard writing systems
  5. the language has multiple conventions (with or without a standard)

Examples of 4: Mongolian, Serbian. And of the last: ASL.

Moreover, scripts have variants. Do “Simplified” and “Traditional” Chinese characters represent different languages? What about various scripts for writing Assyrian Aramaic? Or, looking back in history, what about Egyptian, or the Tamil Brahmic system variants layered onto the language? I don’t know which cases matter to these projects, but they’re relevant linguistically.

I’m a bit late to this thread, hopefully you’re still reading feedback.

My concern for you is about languages.

You have a definition for what you consider to be a language, but in the end, the definition alone may not help you. When people will start to request languages that are not so common, it will become difficult to know if they’re requesting a valid language.

(1) Sometimes it’s very hard to decide if something is a just “variant” rather than a whole language on its own. Unless you have a team of linguists working on categorizing languages, you probably won’t be able to decide properly.

(2) You may also face difficulties when accepting or rejecting constructed languages. You already have Esperanto, that one is easy to accept. But if someone would request Toki Pona, would you accept it or reject it?

Based on those two problems, in Tatoeba we ended up following the ISO 639-3 categorization. This helps us to decide what is a language, what is a dialect/variant, and which constructed languages are “officially” recognized as languages.

I can give you a concrete example of a difficult decision with Arabic.

  • Based on the ISO 639-3 categorization, our contributors are allowed to request each of the languages listed under the Arabic macrolanguage.
  • We had a request to add “Gulf Arabic” as a language: https://github.com/Tatoeba/tatoeba2/issues/1084
  • Some people disagreed with this and argued that there is only one Arabic language and that adding Gulf Arabic will make the Tatoeba corpus messy.
  • Someone counter-argued that there is linguistic evidence for separating Arabic into those many languages.
  • Gulf Arabic is valid based on the ISO 639-3 language list, and we added it.

More recently we had issues with Berber and Kabyle. I won’t go into details for this one, this is just to say that Arabic was not the only difficult case.

Based on my experience, I would recommend that you check what’s available out there and get a predefined list of languages, so that can tell your contributors “These are the languages we acknowledge as languages” and don’t have to worry too much about drawing the line between languages, dialects, variants. Doesn’t have to be ISO 693, there’s maybe something else more suitable for you.

Even with a predefined list, it won’t spare you from lots of headaches, but at least it will give you a direction for deciding what languages you can accept.

Also, if it helps, here’s a snippet of Tatoeba’s instructions for language requests so that you have a concrete idea how we handle this:

Search for your language in the ISO 639-3 list of languages.

[…]

Please understand that if your language is not recognized in the ISO 639-3 standard, we cannot support it in Tatoeba. Language classification is a complex task and it is not part of Tatoeba’s mission. We rely on the ISO 639-3 standard to define what is a valid language.

There are some exceptions due to legacy reasons, but we will not make more exceptions.

If your language is missing in this standard, please contact the ISO 639-3 Registration Authority from their website: https://iso639-3.sil.org/.

My first language is New Zealand English. It has a common writing system with most other Englishes but also has some unique spelling/sound correspondences. These are due to the intermixing with NZs other common language, Maori.
There are many common words borrowed from Maori in New Zealand English. These words are written with the Maori spelling/sound correspondences. E.G ‘Whakatane’ is said fah-kah-TAH-nə, not wack-a-tain. Ngaruawahia = [ŋaːɾʉaˈwaːhia] with the NG being a nasal sound like suNG …
Macros can sometimes also be used for long vowels in Māori. Macros are kept when writing in english.
To complicate things further… There are a wide range of real life ways that non-native and uneducated speakers of Maori or NZ english will pronounce such words. If you showed the prompt “navigate to Whatatane” to someone with no experience with New Zealand English, it would be hard for them to know how to pronounce it.
So both ‘correct’ and incorrect pronunciations are in common use. If someone wanted to make a voice controlled GPS app that could be used by international tourists, and Locals a like then it would be important to capture all these data points.
Google and Vodafone NZ have made a presumably private dataset described here.
https://news.vodafone.co.nz/article/new-zealanders-highlight-te-reo-maori-names-be-updated-google-maps

If you are making an app that is for transcription of text and it is being used by somone outside of NZ , you probably dont want Fah-Kah correspond to the letters Whaka. So this data set needs to be separatable from other Englishes.

So there are a plurality of Englishes around the world which share many but not all words. If each language has just one dataset, will the unique features of each country be left out, or all mixed together. Neither seem desirable. Or will there be a great duplication of words where there is overlap. Also not the best.
Say New Zealand English and Australian English are 95% similar in terms of words and grammar. New Zealand and British are 90% a like and New Zealand and American are also 90% a like but in different ways. Do we need to collect completely different sentence sets for NZ, AUS, UK, US?
If not, what does an American do with the prompt “Rangatoto” or a NZer with “Arkansas” for that matter.
When I talk with English speakers from Kenya or India, they have their own unique set of words and grammars too. These cannot simply be accounted for as accents or informal language.
This study has some good examples of the differences
http://archive.gameswithwords.org/WhichEnglish. … It might need the way back machine to read now. “the dog was chased the cat.” Etc.
(Also Does grammar even matter to an agent trained on sound files / text chunks?).

My current second language is Japanese. I am forever embarrassed by Amazon Alexa’s refusal to understand a word I say even when humans have no problem. I wouldn’t consider myself near native, but I defiantly think my accent is influenced by the region I live in. This is the language community I participate in to become a Japanese speaker, so of course I pick up it’s habits.

As for ascents. I don’t think defining by cities is a good idea. Firstly because about half the world doesn’t live in one (yet). Rural people are already underserved by technology, I would hesitate to choose categories that by design make something less useful to them.
Secondly, because accent is more about language communities, maybe… There is difference based on, age, class, education level, ethnicity, also. Common description of English accents usually have a Cultivated variant, because people like to show how educated they are by changing up their vowels.
Accent is of course related to how different speakers move from graphemes (signs) to morphemes(mental) to phonemes (sounds) , this is segmental. There is also suprasegmental elements to accents, stress, intonation, prosody, pitch. These seem to be missing from this definition.
Sorry of there’s mistakes here I’m no expert. Also it’s hard to write on a phone.

https://i.imgur.com/56VgVmP.jpg Indeed. Which is Serbian, which is Croatian… I don’t want to be the one to say…