Why train.tsv includes a few files (just 3% of validated set)?

susan · August 1, 2019, 3:18am

In many ways I worry more about the source of the sentences more than sentences repeating by different speakers. I am going to train on a dataset for several epochs so sentences are going to repeat in training no matter what. At least with multiple people saying a sentence they can repeat with different voices. The overfit argument looses out to the more data argument for me. I can completely agree about segregating speakers though. I made that mistake once. Wow did those results look good, until I tested again in the wild and things were horrible. Gotta make mistakes to learn though!

Veqtor · February 26, 2020, 7:11am

Ideally I would like to have a different version of the formatting of the data where the train, test and dev point to every version of the repeated sentence, so when you train you can cycle through different speakers when you fetch a certain sentence.

Topic		Replies	Views
Common Voice v1 corpus design problems, overlapping train/test/dev sentences Common Voice dataset	2	2222	April 3, 2018
Single Sentence Record Limit feature release Common Voice announcements	18	3103	June 13, 2022
Dataset split best practices? Common Voice feedback , dataset	23	4783	December 23, 2019
Common Voice datasets (Mandarin zh-tw) Common Voice dataset	2	918	May 23, 2019
Sentences analysis on main languages - Action needed for the ones with deficit Common Voice sentence-collection	14	1958	August 6, 2019

Why train.tsv includes a few files (just 3% of validated set)?

Related topics