Speaker ID split between train/test/dev



there was some discussion earlier this year about whether or not the same speakers appeared in more than one of the training, dev, testing splits. It wasn’t clear whether or not this was the case or not. Could you please confirm what the situation is regarding this if I download the dataset in its current form?

1 Like
(Sdenton4) #2

Hi, I would also like some update on this. The metadata in the current dataset download has last-modified timestamps of Nov 2017, which makes me think that the train/test split is not resolved in the public dataset. I did see some changes in the GitHub in March that looked like they were meant to resolve the issue.

Aside from publishing an entirely new set, it could be nice to just re-bucket the train/test split in the currently available version. I can’t do this myself, since the speaker id’s don’t seem to be available in the metadata. (I’m personally mostly concerned about having the same speakers in train and test.)

(Gregor) #3


there’s a beta release for the new dataset in this thread:

The related metadata file can be split up with the linked CorporaCreator tool to make sure that no speaker overlap exists in the train/dev/test sets.

If you don’t want to do that manual work, we’ll also do a full release in less than a month.

1 Like
(Sdenton4) #4

That’s great, thanks!


I filled out the form for the beta dataset a few days ago but haven’t received any email. Is there any other way to get hold of it? Thanks