Hello, there was some discussion earlier this year about whether or not the same speakers appeared in more than one of the training, dev, testing splits. It wasn’t clear whether or not this was the case or not. Could you please confirm what the situation is regarding this if I download the dataset …

Hi, I would also like some update on this. The metadata in the current dataset download has last-modified timestamps of Nov 2017, which makes me think that the train/test split is not resolved in the public dataset. I did see some changes in the GitHub in March that looked like they were meant to re…

Hi, there’s a beta release for the new dataset in this thread: [image] Multi-language Dataset Beta Release Common Voice The multi-language dataset is now available to the Common Voice community as a beta release! This release includes all new, multi-language data that has…

I filled out the form for the beta dataset a few days ago but haven’t received any email. Is there any other way to get hold of it? Thanks

Speaker ID split between train/test/dev

Common Voice

gregor February 12, 2019, 5:15pm 3

Hi,

there’s a beta release for the new dataset in this thread:

The related metadata file can be split up with the linked CorporaCreator tool to make sure that no speaker overlap exists in the train/dev/test sets.

If you don’t want to do that manual work, we’ll also do a full release in less than a month.

1 Like

Topic		Replies	Views
Data distribution among sets Common Voice	24	1229	September 20, 2021
Dataset split best practices? Common Voice feedback , dataset	23	4859	December 23, 2019
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2238	April 3, 2018
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	933	May 23, 2019
Common Voice mid-year release - more data, more languages! Common Voice announcements , dataset	20	2540	August 12, 2019

Speaker ID split between train/test/dev

Related topics