I’ve tried to use Common Voice datasets on DeepSpeech, I’m wondering to know why the amount of the train/dev/test dataset is almost 1: 1: 1?
And the training datasets didn’t cover all the alphabet (some character in dev/test dataset is not show in the training dataset ), it may cause the validating loss can not decrease as expected. (Maybe I’m wrong, just for guessing )
irvin
(Irvin Chen)
May 22, 2019, 9:43am
2
Did your question been covered in the following threads?
I love this dataset, but I am concerned about an aspect of the dataset split practices. It looks as though the splits that come with the downloads are re-generated as new data is made available. If this is not the case then please let me know since this is a very important on two fronts.
It raises contamination issues for models that have already been trained and are being topped off as more data is made available.
It makes a comparison between different models a challenge since there are …
I have surveyed a few of the languages and one thing I have not found is a good description of how the dev/train/test sets are split. I would be interested to know what the algorithm is. For instance, it looks like early on, where there isn’t a large amount of data in a language, the test and dev sets get nearly equal shares of the data as the train set, but as the data grows the share of data drops and train gets the majority. Is this the case? If so, what are the percentages/cutoffs ?
1 Like
Yes, it’s what I’m asking!
Thanks!