Missing locale info in the tsv files

Hi all! :slight_smile:

First of all, thank you very much for your wonderful work. I’ve been using the CV datasets for years, and they’ve been really helpful for many languages and tasks.

Perhaps someone already replied to this question, but I couldn’t find the answer. I’ve seen that in the 12.0 release the “locale” information is missing from the tsv files. Is this something intentional, or just a mistake? For me it was very useful to be able to differentiate the origin of the speakers.

Thank you very much in advance, and congratulations on the good work again!

Fran.

1 Like

If you mean the “accent” data, here is the related bug report:

Otherwise, the “locale” field does exist in the tsv files of several languages I just checked.

1 Like

Hi Bülent! :slight_smile:

Thank you very much for the quick response, and for pointing me to the right place. I meant indeed the “accent” data. My bad.

Have a nice day,

Fran.

2 Likes

Thanks @bozden for sharing the GitHub Issue link - @jesslynnrose is very kindly looking into this for me as well - and @minstrangeland it’s so good to know others are interested in this data too :slight_smile:

@minstrangeland I have a GitHub repo you may be interested in - it’s a Jupyter notebook of heuristics for working with English accent data - it helps to group and relate the accents. It may be useful for your work, too.

1 Like

@kathyreid . Thanks for the link. I’ll take a look at it :slight_smile:

1 Like

Hello! Sorry about the late reply.

This is indeed a bug impacting multiple languages and our engineers are working to try to have this fixed in upcoming releases.

My apologies for the inconvenience but I massively appreciate you raising the issue and letting us know!

1 Like

Hi @jesslynnrose! :slight_smile:

Thanks for letting me know, and no apologies about the delay. You were pretty fast.

Have a nice day!

Fran.