hi, @othiele
Since I have very limited training time, maybe 50 epoch. And I heard that dropout rate may lead to higher loss in early epoch but have best convergence loss which needs lots of training resource and time. So which dropout rate would you recommend? (I am guessing 0.25) Also learning rate may affect, too. Maybe using 0.001 is enough?
Thanks!
idk man
v0.7.0 is trained with 325 epoch, the loss is 14 but WER is 12.2 (w/o LM and beam=1)
v0.6.1 is trained with 75 epoch, the loss is 17 but WER is 15.1 (w/o LM and beam=1)
Note that in these case LM optimizer won’t affect when evaluation
It seems like even if loss only drop 3 but WER improves a lot and that latter training epoch may be necessary.
I still need to train to convergence point(maybe WER<15?) so I am trying to make it converge more quickly.
I’ll do experiments. Just that asking some advice may be much more efficient.
Don’t go with fixed percentages, this will quickly lead to wasting training data. Do a statistical power analysis and go with a sample size that gives you good enough confidence in the results.
Its dev set and test set are very small. Just more training data compared to my case.
Also, does anyone know the reason of huge gap between v0.7.0 and v0.6.1 as I mentioned above? (or is it just because larger epoch?) I think this would influence my determination of training epoch.
The training details are included in the release notes.
LibriSpeech’s dev and test sets are indeed small, but they are almost spotless (no transcription errors). We have found that being error free makes a much bigger impact in the usefulness of a validation and testing set than just its size.
Training Regimen + Hyperparameters for fine-tuning
I did not read the 75/15/10 thing in that tho. But yeah I indeed read that thing in “train french model” post. So I guess this won’t affect too much when I am not training with custom dataset?
I am kinda confused. If “bigger impact” means positive then my dev/test sets have no problem right? (I’m just using librispeech 1k hr data apparently)
Btw, the loss reach 20 and WER reach 18 in 20 epoch. Maybe the trend is not that bad?
This is probably last question here since the thread is becoming out of focus. Sorry about that.
If you are training on the LibriSpeech, go with @reuben. If you have some selfmade dataset, I would use 10-15% as dev to get good results. If you have thousands of hours, this can be smaller.
Hello @othiele, i would like to ask for advice on the training our own dataset.
Should the data of dev / test is come from the data of train? Or the data of dev / test must not as the same as the sentence from train?
And what’s the purpose of n_hidden variable? I change it to 2048 / 1024 / 512 and an initial train with the same lm.binary and kenlm.scorer. There are sometimes occurs “Segmentation fault” with 2048 n_hidden, but not occurs with 512 n_hidden. For using the different size of the dataset and with different n_hidden, the “Segmentation fault” may not occur again.
Although the “Segmentation fault” is related to TensorFlow, I was confused. When I use different n_hidden, it sometimes occurs and sometimes not.
The dev set is used to change parameters for the next epoch. Ideally this is good data that is very close to what you want to recognize later. @reuben said that they use very little, very excellent data for that. I don’t have that so I take just 7-10% of my training data for that.
For the test set, we built a special smaller set that represents all the different nuances that we want to evaluate. This is handcrafted and we use the same one for all trainings to evaluate not just training but real world fit.
n_hidden makes the neural net more complex. We found that there is not much difference between 2048 and 512 for our smaller datasets, but we don’t have 5000 hours
Definitely increase the batch size as high as you can, this will speed up training.
I remember having this error just with bad data, shouldn’t happen with great data. @lissyx, what do you think?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
18
Sometimes, I’m happy, sometimes I’m not. There’s not much you can do? Well there’s not much I can do with your report.
For the “Segmentation fault”, I understand that should be related to the bad data. But the bad data is it means the bad quality (too much noise) or the quantity is not enough?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
20
Obviously you did not get my point: there’s nothing we can do with just “segmentation fault”: please share logs, gdb stack, etc. It could be corrupted WAV file, or one that makes tensorflow unhappy. Again, without anything actionable, we are all loosing our time thinking at nothing.
Understand. I will review all the WAV files make sure that there is no corrupt WAV file first.
If I catch the “segmentation fault” again, I will prepare the share logs, transcripts, WAV files, and command script for investigation in a NEW thread.