Why train.tsv includes a few files (just 3% of validated set)?

susan · March 8, 2019, 8:35pm

It is my current understanding that the train/dev/test sets are completely re-generated each release with no guarantee that the previous split data will be reflected so I would caution against using the released splits as an academic source. See this thread: Dataset split best practices?

Topic		Replies	Views
Single Sentence Record Limit feature release Common Voice announcements	18	3118	June 13, 2022
Dataset split best practices? Common Voice feedback , dataset	23	4860	December 23, 2019
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2238	April 3, 2018
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	336	February 28, 2025
Do the Common Voice datasets contain multiple audio samples for the same text in the same language? Common Voice dataset	9	2245	April 20, 2020

Why train.tsv includes a few files (just 3% of validated set)?

Related topics