Hi all!
First of all, thank you very much for your wonderful work. I’ve been using the CV datasets for years, and they’ve been really helpful for many languages and tasks.
Perhaps someone already replied to this question, but I couldn’t find the answer. I’ve seen that in the 12.0 release the “locale” information is missing from the tsv files. Is this something intentional, or just a mistake? For me it was very useful to be able to differentiate the origin of the speakers.
Thank you very much in advance, and congratulations on the good work again!
Fran.
1 Like
bozden
(Bülent Özden)
February 21, 2023, 6:17pm
2
If you mean the “accent” data, here is the related bug report:
opened 04:45AM - 05 Feb 23 UTC
closed 11:05AM - 16 Mar 23 UTC
Bug
**Describe the bug**
In the v12 release of the Common Voice dataset in English … (`en`) - I haven't checked for other languages, there is is no **accent** data in the `validated.tsv` files, or any of the other split files (such as `train.tsv`). I am using the accent data for research purposes.
If I compare to the v11 release, the accent data for `en` is provided in a field of the `validated.tsv` file in comma separated form, e.g.:
```
'United States English, Midwestern United States English'
```
None of this data is provided in the v12 dataset release.
I identified this by working in `pandas` - my working Jupyter notebook is at:
[https://github.com/KathyReid/cvaccents](https://github.com/KathyReid/cvaccents)
There is a 0 row count for "rows with accents" in v12, compared with
v12 `validated.tsv` file in a `pandas` dataframe:
```
# rows that have accent metadata
len(df[df['accents'].notna()])
0
```
same in v11 `validated.tsv` file:
```
# rows that have accent metadata
len(df[df['accents'].notna()])
861134
```
**To Reproduce**
As described above, please compare `validated.tsv` for v11 versus v12 of the `en` corpus.
**Expected behavior**
Accent data to be included in `validated.tsv`, please 😄 🙏🏾
**Screenshots**
Not needed here, but can provide if needed.
**Desktop or Mobile (please complete the following information):**
Not applicable to this bug.
**Additional Hardware (were you using headphones, an external speaker or an external microphone?):**
Not applicable
**Additional context**
I am researching voice accent data, and want to replicate my workflow on v12 of the data (I have a workflow for examining v11 of the `en` dataset).
Otherwise, the “locale” field does exist in the tsv files of several languages I just checked.
1 Like
Hi Bülent!
Thank you very much for the quick response, and for pointing me to the right place. I meant indeed the “accent” data. My bad.
Have a nice day,
Fran.
2 Likes
kathyreid
(Kathy Reid)
February 22, 2023, 10:08pm
4
Thanks @bozden for sharing the GitHub Issue link - @jesslynnrose is very kindly looking into this for me as well - and @minstrangeland it’s so good to know others are interested in this data too
@minstrangeland I have a GitHub repo you may be interested in - it’s a Jupyter notebook of heuristics for working with English accent data - it helps to group and relate the accents. It may be useful for your work, too.
1 Like
@kathyreid . Thanks for the link. I’ll take a look at it
1 Like
Hello! Sorry about the late reply.
This is indeed a bug impacting multiple languages and our engineers are working to try to have this fixed in upcoming releases.
My apologies for the inconvenience but I massively appreciate you raising the issue and letting us know!
1 Like
Hi @jesslynnrose !
Thanks for letting me know, and no apologies about the delay. You were pretty fast.
Have a nice day!
Fran.