there was some discussion earlier this year about whether or not the same speakers appeared in more than one of the training, dev, testing splits. It wasn’t clear whether or not this was the case or not. Could you please confirm what the situation is regarding this if I download the dataset in its current form?
Hi, I would also like some update on this. The metadata in the current dataset download has last-modified timestamps of Nov 2017, which makes me think that the train/test split is not resolved in the public dataset. I did see some changes in the GitHub in March that looked like they were meant to resolve the issue.
Aside from publishing an entirely new set, it could be nice to just re-bucket the train/test split in the currently available version. I can’t do this myself, since the speaker id’s don’t seem to be available in the metadata. (I’m personally mostly concerned about having the same speakers in train and test.)