🗣 Feedback needed: Languages and accents strategy

gartenfeld · June 10, 2019, 9:40am

Hi, I’m a bit late to the party here, but I’d like to offer a linguist’s point view.

TLDR; we should not crowdsource these definitions. Incorporate academic resources instead.

Writing is a technology that has to be invented recently.
Native speakers universally acquire their native language.
A natural language has an internally consistent phonology.
Spoken variations for continuums; division into “languages” are sometime political or historical.
The official version of a language is often highly codified, constructed, and “unnatural” (far from spoken varieties).

Example: Norwegian is a language group, the varieties are largely mutually intelligible with each other, and with Swedish. The two formalised standards are Bokmål (which half-jokingly is a koineized version written in Danish), and Nynorsk (an imaginary proto-version), both of which no-one “really speaks”.

Comparable situation with Finnish. The official version that has a standard is an invention by amalgamating features from natural varieties, it’s highly constructed (though can be spoken).

Example: dialect continuum.

Consider:
ENGLISH: I am the son of my father and my mother.
SCOTS: A am the son o ma faither an ma mither.
FRISIAN: Ik bin de soan fan myn heit en myn mem.
DUTCH: Ik ben de zoon van mijn vader en mijn moeder.

Consider:
The Balkan example mentioned above.

What are the recommendations then?

For high-resource languages that have standard bodies, the meta-data should designate speaker status of whether they are producing the standardised variety, e.g. a “native” English speaker, who can either use the General American, or Standard Southern British
For regional varieties, the meta-data should designate native speakers of a variety, as defined by widely established dialectology.
Non-native speech should be labelled as such. There are varying levels of “accentedness”, from highly consistent L1-interference (in this case, you may say that the speaker has created a merged internal phonology in the process), to rampant lexical errors (e.g. using wrong tone or quantity as a result of having no control over phonemic contrast).

Now in terms of ASR, conventionally there are two models: the acoustic models and the language model. At some point it may be helpful to also have a separate phonology model: e.g. which phonemes can occur together, how they change into allophones in different contexts, or in the case of non-native phonology, substitutions etc.

gartenfeld · June 10, 2019, 10:14am

In practical terms, what crowdsourced questions would be useful for describing the speech production itself. I imagine independent of the language/variety designation, we can get meaningful self-reported information along several axes:

Stable-Unstable
“When you speak this language, how stable is your accent over time?”
This goes from a native variety-speaker, to say, cosmopolitan Finns who speaks convincing TV English, but the accent varies widely from week to week.
Cf. https://www.phonetik.uni-muenchen.de/~jmh/research/papers/harrington00.nature.pdf

Convincing-Accented
“How do others (especially native speakers) perceive your accent?”
This only applies when you are aiming for an idealised target, e.g. an actor in a film playing a speaker of some other variety. Note: here “accented” may be a bit misleading, since “convincing” is relative to the selected target, for example, when there is a consistent L1-mediated non-native phonology, the actor can put on a convincing “Russian accent” in English.

Regional-Koineized
“Are you speaking a variety that is used when regional locals talk to each other?”
When Glaswegians talk to Glaswegians, the production may be different from when they talk to New Zealanders.

ftyers · June 20, 2019, 7:00pm

Do you have any empirical evidence that people are not able to self-classify their accents and that the subsequent classifications are not useful for the task of producing targetted speech recognition?

ftyers · June 20, 2019, 7:02pm

This is more or less the right approach.

jsalsman · July 9, 2019, 9:53pm

@ftyers sometimes. Ferragne_2010_JPho.pdf has a British Isles English accent map, sort of. (1.3 MB)

nukeador · July 12, 2019, 8:26am

Sorry everyone for taking so long to provide an update here.

We have been analyzing all your feedback, as well as consulting with more linguistic experts, both online and in person.

We are currently getting agreement on the final proposal, that I’ll share here as soon as it’s ready. It’s currently leaning towards a less restrictive approach (as lot of you asked for).

Thanks for your patience.

nukeador · October 1, 2019, 11:56am

October update:

We are still waiting to have a few more conversations with linguistics (sorry, this is taking way longer than we expected) and we have also have been trying to balance the current proposal so we bring value to both product and linguistic researchers using our dataset.

A lot of this project has been learning as we go, and the complexity of providing value to everyone is higher than we initially though.

My plan is to be able to come back with a recommendation by the end of this month.

Thanks all for your understanding!

belkacem77 · November 16, 2019, 10:04pm

Hi @txopi
It’s the same with Kabyle language.

When I met some members from the Garabide foundation we talked about the different dialects and accents. They don’t correspond to the administrative repartition. It’s the same with Kabylia.

belkacem77 · November 16, 2019, 10:06pm

Thanks for the lesson!!
Nice to know that

nukeador · November 18, 2019, 12:31pm

Hey everyone,

I know we haven’t published this yet, but getting to an agreement is taking a lot of time since I’m scheduling personal conversation with different stakeholders.

My goal is to get green light from Deep Speech and legal team this week so we can share it here.

Cheers.

nukeador · November 22, 2019, 5:35pm

Update: As we have expressed here, we are definitely considering a more location-oriented metadata strategy to understand how people is likely to sound (as an improved version of the May proposal posted here)

We are right now evaluating with our legal team the requirements and limitations. We’ll share it as soon as we have agreement.

nukeador · December 18, 2019, 12:22pm

December update:

We have been working with our legal team about this proposal in the past month. Having in mind the current development focus is on infrastructure and we won’t have the bandwidth to implement any changes on the site as a result of this proposal until at least February 2020, we agreed that we’ll give the legal team more time to finish the review and come back with recommendations.

Ideally this will happen mid/late-January so we have clarity to start planning on implementation by February.

I’ll keep you posted and share the final proposal after the legal recommendations are out.

Thanks for your patience!

nukeador · March 25, 2020, 4:37pm

The latest version of the strategy (v5) is now published here

Topic		Replies	Views
Common Voice languages and accent strategy v5 Common Voice announcements	13	5659	August 4, 2021
Help preserving dialects from vanishing by allowing to add a dialect flag to spoken language Common Voice	16	1987	February 10, 2020
Bias against accented speech from voting instead of transcribing Common Voice	9	918	February 3, 2023
List of languages with variants launched on common voice Common Voice	5	841	October 16, 2024
Ask Me Anything (AMA) session on Common Voice Variants for Languages Common Voice participation	5	2327	January 24, 2022

🗣 Feedback needed: Languages and accents strategy

Related topics