Native language in dataset

cjbaker · June 9, 2020, 4:54pm

I notice that my profile on the site has “Native language” and “Additional languages” fields, but I can’t seem to find this information in the datasets. The dataset .tsv files just have “accent”. Am I missing something, or would it be possible to include this data in a future release?

It would be useful, for example, to be able to download the French dataset and determine which speakers were non-natives and what is their native language. This information would be critical for doing automatic accent identification, and might also be useful for doing speech model adaptation as well as testing robustness on different accents. I see that there are plans to add a “native” field in the languages and accents strategy, but if the information is already in the database it would be useful to have.

Christos · June 18, 2020, 12:44pm

Hi Craig and thanks for your message.
That’s a great topic that I am going to bring up on our upcoming Product meeting to discuss.

I will get back to you as soon as I have an update.

cjbaker · July 1, 2020, 3:00am

Hi Christos, I’ve just downloaded the new dataset (for French) and it doesn’t seem to contain a native language field. There is a ‘locale’ field (I think this is new?) but the only value is ‘fr’ for all the clips. Is there any chance I might be able to get this information for the client_ids in the existing dataset? Thank you -Craig

Topic		Replies	Views
Labelled data of Native and non-native speakers Common Voice	3	509	January 21, 2024
Missing locale info in the tsv files Common Voice	6	536	February 27, 2023
Add non-native field Common Voice feedback	5	854	May 9, 2019
:speaking_head: Feedback needed: Languages and accents strategy Common Voice participation , feedback	50	7509	March 25, 2020
Add in dataset Sakha language Common Voice dataset	5	1330	April 25, 2019

Native language in dataset

Related topics