I had 3 lacs audio voice from which 70% i used it for training and 15% for validation and other 15% in test and the result which i got on test after training it on deepspeech 0.4.1 is WER 0.32 CER 0.11 and loss 6 .
After that i increased my dataset eventually to 6 lacs.
I just want to know should i change my test dataset again and take 10% of 6 lacs and see if my model is doing better on test than previous or is it wrong to evaluate that way?
Or should i keep my test dataset same while keep increasing my training dataset and then test on the same dataset?
How to know that with more training on audio file the model is improving? What is the right approach?
As you have a lot of samples, taking a simple percentage cut can be wasteful of training data. I recommend using a sample size calculator like this one: https://www.surveymonkey.com/mp/sample-size-calculator/ (We use 99% confidence level with 1% margin of error for our dev/test sample sizes)
You should update to current master as there have been a lot of improvements since v0.4.1
If you want to make apples to apples comparisons between different models then the validation/test sets need to be identical. If you’re continuously collecting data, fixed dev/test sets will tend to be more and more biased over time as new training data gets added. To handle this, I recommend making new dev/test sets occasionally and then passing multiple files to --dev_files/--test_files so that you can keep track of things correctly. You can think of it as a bit of a versioning scheme, having e.g. dev_v1.csv, dev_v2.csv, etc, as you collect data. That way you’ll be able to know if you’re regressing on a set that you previously did well on.
I used the calculator and i had 570838 file so i inserted population size as same and after keeping confidence as 99 and margin of error 1. I am getting sample size 16170. So do you mean out of 570838 file i should divide 16170 file between test and dev dataset?
Thank you i understood how the distribution should be done.
After doing the same i am getting training loss as infinity while validation loss is reducing every epoch.
The learning rate which i have kept is 0.0001.
What can be the reason of such result?
Previously when i trained on 3 lacs data i didn’t get such infinity loss on train. Now after increasing data from 3 to 6 lacs i am getting infinity loss on train.
Can corrupt data be a reason? If yes then how to remove such corrupt file or identify such corrupt file which causing training as infinity
Corrupt data could be the cause, but I’ve never seen corrupt data cause this problem. I’ve only seen it arise as a result of the learning rate being to high.
Even after keeping the learning rate as 0.00005 i am still getting infinity on training data. Can you tell me how and which file of v0.6 identify audio file that causes infinity? and can i apply the same file directly to 0.4.1 without updating it into 0.6?
You cannot, I tried to introduce such a check to v0.5.1prior to release and it would be much work. It is easier to update to v0.6.0. If you have a file which produces NaN or infinite loss it will be printed to you which one specifically was problematic.