Common Voice datasets (Mandarin zh-tw)

I’ve tried to use Common Voice datasets on DeepSpeech, I’m wondering to know why the amount of the train/dev/test dataset is almost 1: 1: 1?

And the training datasets didn’t cover all the alphabet (some character in dev/test dataset is not show in the training dataset ), it may cause the validating loss can not decrease as expected. (Maybe I’m wrong, just for guessing )

Did your question been covered in the following threads?


1 Like

Yes, it’s what I’m asking!
Thanks!