@kdavis: If I have 1000h of data. What should be my ideal Train, Dev and Test size?
@agarwalaashish20 I’m not @kdavis but the standard for ML in (80/10/10) % for (train/eval/test) for most datasets.
The standard, unfortunately, is wrong almost all of the time.
@agarwalaashish20 The answer depends on how many clips you have.
Basically, if you have N clips and you want T clips in the training set, V clips in the validation set, and E clips in the test set, then you first need to have the obvious constraint T + V + E <= N, where you try to have T, V, and E as large as possible. In addition, you want to have V and E be “statistically significant sample sizes” relative to T.
Concretely, you’d define a “statistically significant sample size” using something like the sample size calculator with a confidence level of 99% and a margin of error of 1%. This would mean, for example, that a T of 1000000 would require that V and E both be 16369.
The reason for doing this is to insure that, for example, the WER calculated using T would closely track, as defined by confidence level of 99% and a margin of error of 1%, the the WER calculated using V or E as both V and E are “statistically significant sample sizes” relative to T.