Hello all,
Is the data of Non native speaking German available… If yes, is there a way to identify the native language of the user speaking German.
Regards,
Lakshmi
Hello all,
Is the data of Non native speaking German available… If yes, is there a way to identify the native language of the user speaking German.
Regards,
Lakshmi
Hey @Lakshmi, welcome.
There is no L1/L2 distinction in the dataset. But there is the “accent(s)” field, where people can fill voluntarily. For that, they should first register and record as logged-in. But, as it is a free field, one should work on them to re-categorize. But I’m afraid this will not be enough.
To accomplish what you desire, you would need to label each speaker in a separate project by listening samples from each speaker. But, IMO, even then, it will not be scientifically correct.
There is also the “variant” field for some languages (inc. geographical dialects), as it was introduced later, most of the related data is on accents. Deutsch does not have variants, but has the following in “accents” presets, in addition to free-form ones:
268 de preset germany Deutschland Deutsch
269 de preset netherlands Niederländisch Deutsch
270 de preset austria Österreichisches Deutsch
271 de preset poland Polnisch Deutsch
272 de preset switzerland Schweizerdeutsch
273 de preset united_kingdom Britisches Deutsch
274 de preset france Französisch Deutsch
275 de preset denmark Dänisch Deutsch
276 de preset belgium Belgisches Deutsch
277 de preset hungary Ungarisch Deutsch
278 de preset brazil Brasilianisches Deutsch
279 de preset czechia Tschechisch Deutsch
280 de preset united_states Amerikanisches Deutsch
281 de preset slovakia Slowakisch Deutsch
282 de preset russia Russisch Deutsch
283 de preset kazakhstan Kasachisch Deutsch
284 de preset italy Italienisch Deutsch
285 de preset finland Finnisch Deutsch
286 de preset slovenia Slowenisch Deutsch
287 de preset canada Kanadisches Deutsch
288 de preset bulgaria Bulgarisch Deutsch
289 de preset greece Griechisch Deutsch
290 de preset lithuania Litauisch Deutsch
291 de preset luxembourg Luxemburgisches Deutsch
292 de preset paraguay Paraguayisch Deutsch
293 de preset romania Rumänisch Deutsch
294 de preset liechtenstein liechtensteinisches Deutscher
295 de preset namibia Namibisch Deutsch
296 de preset turkey Türkisch Deutsch
I think @ftyers can give you more information on the reasons behind the dataset decisions and possible solutions.
We should also consider the ethical implications here. Programmatically distinguishing between native and non-native speakers on the basis of accent may allow the automation of existing accent biases, which are well documented in the literature.
Many non-native speakers are already marginalised in society because of the way they speak, and we need to be careful not to accelerate marginalisation simply because we have the data to be able to do so by using Common Voice accent data.
One example in which this is happening is in immigration control, for example:
Korkmaz, Yunus, and Aytuğ Boyacı. “A comprehensive Turkish accent/dialect recognition system using acoustic perceptual formants.” Applied Acoustics 193 (2022): 108761.
@kathyreid, never thought about that - and the example is for Turkish
Somehow my mentality was shifted towards language learning systems, such as L1’s go into training… So, in live data, (L2’s) could error out because of bad pronunciation etc…