Data distribution among sets

heyhillary · September 16, 2021, 11:34am

"Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set. "

For more details on the metadata please check out the github: GitHub - common-voice/cv-dataset: Metadata and versioning details for the Common Voice dataset

I would love to learn more about your project, if you would like to please feel free to share on this thread: Talk to us! How are you using Common Voice?

Topic		Replies	Views
Dataset split best practices? Common Voice feedback , dataset	23	4872	December 23, 2019
Speaker ID split between train/test/dev Common Voice dataset	4	1058	February 15, 2019
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	937	May 23, 2019
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2245	April 3, 2018
Common Voice for Healthcare (Edge Cases) Common Voice	6	594	August 26, 2024

Data distribution among sets

Related topics