Labelled data of Native and non-native speakers

Is the data of Non native speaking German available… If yes, is there a way to identify the native language of the user speaking German.


Hey @Lakshmi, welcome.

There is no L1/L2 distinction in the dataset. But there is the “accent(s)” field, where people can fill voluntarily. For that, they should first register and record as logged-in. But, as it is a free field, one should work on them to re-categorize. But I’m afraid this will not be enough.

To accomplish what you desire, you would need to label each speaker in a separate project by listening samples from each speaker. But, IMO, even then, it will not be scientifically correct.

There is also the “variant” field for some languages (inc. geographical dialects), as it was introduced later, most of the related data is on accents. Deutsch does not have variants, but has the following in “accents” presets, in addition to free-form ones:

268	de	preset	germany	Deutschland Deutsch
269	de	preset	netherlands	Niederländisch Deutsch
270	de	preset	austria	Österreichisches Deutsch
271	de	preset	poland	Polnisch Deutsch
272	de	preset	switzerland	Schweizerdeutsch
273	de	preset	united_kingdom	Britisches Deutsch
274	de	preset	france	Französisch Deutsch
275	de	preset	denmark	Dänisch Deutsch
276	de	preset	belgium	Belgisches Deutsch
277	de	preset	hungary	Ungarisch Deutsch
278	de	preset	brazil	Brasilianisches Deutsch
279	de	preset	czechia	Tschechisch Deutsch
280	de	preset	united_states	Amerikanisches Deutsch
281	de	preset	slovakia	Slowakisch Deutsch
282	de	preset	russia	Russisch Deutsch
283	de	preset	kazakhstan	Kasachisch Deutsch
284	de	preset	italy	Italienisch Deutsch
285	de	preset	finland	Finnisch Deutsch
286	de	preset	slovenia	Slowenisch Deutsch
287	de	preset	canada	Kanadisches Deutsch
288	de	preset	bulgaria	Bulgarisch Deutsch
289	de	preset	greece	Griechisch Deutsch
290	de	preset	lithuania	Litauisch Deutsch
291	de	preset	luxembourg	Luxemburgisches Deutsch
292	de	preset	paraguay	Paraguayisch Deutsch
293	de	preset	romania	Rumänisch Deutsch
294	de	preset	liechtenstein	liechtensteinisches Deutscher
295	de	preset	namibia	Namibisch Deutsch
296	de	preset	turkey	Türkisch Deutsch

I think @ftyers can give you more information on the reasons behind the dataset decisions and possible solutions.

We should also consider the ethical implications here. Programmatically distinguishing between native and non-native speakers on the basis of accent may allow the automation of existing accent biases, which are well documented in the literature.

Many non-native speakers are already marginalised in society because of the way they speak, and we need to be careful not to accelerate marginalisation simply because we have the data to be able to do so by using Common Voice accent data.

One example in which this is happening is in immigration control, for example:

@kathyreid, never thought about that - and the example is for Turkish :frowning:

Somehow my mentality was shifted towards language learning systems, such as L1’s go into training… So, in live data, (L2’s) could error out because of bad pronunciation etc…

