Speaker ID split between train/test/dev

ElChocolatero · December 14, 2018, 4:23pm

Hello,

there was some discussion earlier this year about whether or not the same speakers appeared in more than one of the training, dev, testing splits. It wasn’t clear whether or not this was the case or not. Could you please confirm what the situation is regarding this if I download the dataset in its current form?

sdenton4 · February 12, 2019, 4:53pm

Hi, I would also like some update on this. The metadata in the current dataset download has last-modified timestamps of Nov 2017, which makes me think that the train/test split is not resolved in the public dataset. I did see some changes in the GitHub in March that looked like they were meant to resolve the issue.

Aside from publishing an entirely new set, it could be nice to just re-bucket the train/test split in the currently available version. I can’t do this myself, since the speaker id’s don’t seem to be available in the metadata. (I’m personally mostly concerned about having the same speakers in train and test.)

gregor · February 12, 2019, 5:15pm

Hi,

there’s a beta release for the new dataset in this thread:

The related metadata file can be split up with the linked CorporaCreator tool to make sure that no speaker overlap exists in the train/dev/test sets.

If you don’t want to do that manual work, we’ll also do a full release in less than a month.

sdenton4 · February 12, 2019, 5:36pm

That’s great, thanks!

ElChocolatero · February 15, 2019, 12:10pm

I filled out the form for the beta dataset a few days ago but haven’t received any email. Is there any other way to get hold of it? Thanks

Topic		Replies	Views
Data distribution among sets Common Voice	24	1141	September 20, 2021
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	905	May 23, 2019
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2197	April 3, 2018
Older English dataset question Common Voice dataset	6	1468	June 15, 2021
How are the dev/test/train datasets split? Common Voice dataset	4	2657	March 7, 2019

Speaker ID split between train/test/dev

Related topics