Native language in dataset

I notice that my profile on the site has “Native language” and “Additional languages” fields, but I can’t seem to find this information in the datasets. The dataset .tsv files just have “accent”. Am I missing something, or would it be possible to include this data in a future release?

It would be useful, for example, to be able to download the French dataset and determine which speakers were non-natives and what is their native language. This information would be critical for doing automatic accent identification, and might also be useful for doing speech model adaptation as well as testing robustness on different accents. I see that there are plans to add a “native” field in the languages and accents strategy, but if the information is already in the database it would be useful to have.

2 Likes

Hi Craig and thanks for your message.
That’s a great topic that I am going to bring up on our upcoming Product meeting to discuss.

I will get back to you as soon as I have an update.

3 Likes

Hi Christos, I’ve just downloaded the new dataset (for French) and it doesn’t seem to contain a native language field. There is a ‘locale’ field (I think this is new?) but the only value is ‘fr’ for all the clips. Is there any chance I might be able to get this information for the client_ids in the existing dataset? Thank you -Craig