The loss keeps rising for v0.7.0

othiele · April 29, 2020, 6:08pm

Default dropout is really low with 0.05, set it to 0.25 or higher. That should change things. And train with GPU if possible.

1258 · April 29, 2020, 9:21pm

Thanks for your reply! I’ll try higher dropout ratio. I am using 8G 1080 tho.

1258 · May 8, 2020, 11:21am

hi, @othiele
Since I have very limited training time, maybe 50 epoch. And I heard that dropout rate may lead to higher loss in early epoch but have best convergence loss which needs lots of training resource and time. So which dropout rate would you recommend? (I am guessing 0.25) Also learning rate may affect, too. Maybe using 0.001 is enough?
Thanks!

othiele · May 8, 2020, 11:51am

Learning rate of 0.0001 is normal for fresh data sets.

Split all data 75/15/10 for train/dev/test, yours looks different.

You have to experiment a bit, but 15 epochs should give ok results.

1258 · May 8, 2020, 12:08pm

Thanks for your reply!

Sorry, What does this mean?

idk man
v0.7.0 is trained with 325 epoch, the loss is 14 but WER is 12.2 (w/o LM and beam=1)
v0.6.1 is trained with 75 epoch, the loss is 17 but WER is 15.1 (w/o LM and beam=1)
Note that in these case LM optimizer won’t affect when evaluation
It seems like even if loss only drop 3 but WER improves a lot and that latter training epoch may be necessary.

I still need to train to convergence point(maybe WER<15?) so I am trying to make it converge more quickly.

I’ll do experiments. Just that asking some advice may be much more efficient.

othiele · May 8, 2020, 1:36pm

About 15% of your data should be used for the dev/validation set. Looking at your numbers, this is more 1% for you. This will lead to bad results.

1258 · May 8, 2020, 3:34pm

Ok, will do. Thanks!

reuben · May 8, 2020, 6:34pm

Don’t go with fixed percentages, this will quickly lead to wasting training data. Do a statistical power analysis and go with a sample size that gives you good enough confidence in the results.

1258 · May 9, 2020, 5:41am

Just curious, did v0.7.0 pretrained model train with this?

train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.
dev_files LibriSpeech clean dev corpus.
test_files LibriSpeech clean test corpus
train_batch_size 128
dev_batch_size 128
test_batch_size 128
n_hidden 2048
learning_rate 0.0001
dropout_rate 0.40
epochs 125

Its dev set and test set are very small. Just more training data compared to my case.

Also, does anyone know the reason of huge gap between v0.7.0 and v0.6.1 as I mentioned above? (or is it just because larger epoch?) I think this would influence my determination of training epoch.

reuben · May 9, 2020, 6:08am

The training details are included in the release notes.

LibriSpeech’s dev and test sets are indeed small, but they are almost spotless (no transcription errors). We have found that being error free makes a much bigger impact in the usefulness of a validation and testing set than just its size.

1258 · May 10, 2020, 11:55am

I think you mean

Training Regimen + Hyperparameters for fine-tuning

I did not read the 75/15/10 thing in that tho. But yeah I indeed read that thing in “train french model” post. So I guess this won’t affect too much when I am not training with custom dataset?

I am kinda confused. If “bigger impact” means positive then my dev/test sets have no problem right? (I’m just using librispeech 1k hr data apparently)
Btw, the loss reach 20 and WER reach 18 in 20 epoch. Maybe the trend is not that bad?

This is probably last question here since the thread is becoming out of focus. Sorry about that.

othiele · May 10, 2020, 12:14pm

If you are training on the LibriSpeech, go with @reuben. If you have some selfmade dataset, I would use 10-15% as dev to get good results. If you have thousands of hours, this can be smaller.

1258 · May 10, 2020, 12:16pm

Much appreciated to @othiele and @reuben !

axcn · June 16, 2020, 2:50am

Hello @othiele, i would like to ask for advice on the training our own dataset.

Should the data of dev / test is come from the data of train? Or the data of dev / test must not as the same as the sentence from train?

And what’s the purpose of n_hidden variable? I change it to 2048 / 1024 / 512 and an initial train with the same lm.binary and kenlm.scorer. There are sometimes occurs “Segmentation fault” with 2048 n_hidden, but not occurs with 512 n_hidden. For using the different size of the dataset and with different n_hidden, the “Segmentation fault” may not occur again.

Although the “Segmentation fault” is related to TensorFlow, I was confused. When I use different n_hidden, it sometimes occurs and sometimes not.

othiele · June 16, 2020, 8:34am

The dev set is used to change parameters for the next epoch. Ideally this is good data that is very close to what you want to recognize later. @reuben said that they use very little, very excellent data for that. I don’t have that so I take just 7-10% of my training data for that.

For the test set, we built a special smaller set that represents all the different nuances that we want to evaluate. This is handcrafted and we use the same one for all trainings to evaluate not just training but real world fit.

n_hidden makes the neural net more complex. We found that there is not much difference between 2048 and 512 for our smaller datasets, but we don’t have 5000 hours

Definitely increase the batch size as high as you can, this will speed up training.

I remember having this error just with bad data, shouldn’t happen with great data. @lissyx, what do you think?

lissyx · June 16, 2020, 8:38am

Sometimes, I’m happy, sometimes I’m not. There’s not much you can do? Well there’s not much I can do with your report.

axcn · June 16, 2020, 8:51am

Thank you @othiele and @lissyx for your reply.

For the “Segmentation fault”, I understand that should be related to the bad data. But the bad data is it means the bad quality (too much noise) or the quantity is not enough?

lissyx · June 16, 2020, 8:54am

Obviously you did not get my point: there’s nothing we can do with just “segmentation fault”: please share logs, gdb stack, etc. It could be corrupted WAV file, or one that makes tensorflow unhappy. Again, without anything actionable, we are all loosing our time thinking at nothing.

axcn · June 16, 2020, 9:03am

Understand. I will review all the WAV files make sure that there is no corrupt WAV file first.

If I catch the “segmentation fault” again, I will prepare the share logs, transcripts, WAV files, and command script for investigation in a NEW thread.

tanner · July 10, 2020, 1:57am

9 posts were split to a new topic: Training/fine-tuning DeepSpeech branch/version - 0.7.0 on Linux