The loss keeps rising for v0.7.0

So I try to train v0.7.0 model and this is my command

python ./DeepSpeech.py \
--n_hidden 2048 \
--train_files ../../librispeech/librivox-train-clean-100.csv,../../librispeech/librivox-train-clean-360.csv,../../librispeech/librivox-train-other-500.csv \
--dev_files ../../librispeech/librivox-dev-clean.csv \
--test_files ../../librispeech/librivox-test-clean.csv \
--train_batch_size 32 \
--dev_batch_size 16 \
--test_batch_size 16 \
--epochs 50 \
--export_dir ./export/v0/v5 \
--checkpoint_dir ./checkpoint/v0/v5

Training loss
Epoch 0 | Training | Elapsed Time: 5:24:57 | Steps: 8788 | Loss: 173.039561
Epoch 1 | Training | Elapsed Time: 5:23:45 | Steps: 8788 | Loss: 114.268374
Epoch 2 | Training | Elapsed Time: 5:23:26 | Steps: 8788 | Loss: 102.552054
Epoch 3 | Training | Elapsed Time: 5:23:36 | Steps: 8788 | Loss: 97.543262
Epoch 4 | Training | Elapsed Time: 5:23:56 | Steps: 8788 | Loss: 95.154268
Epoch 5 | Training | Elapsed Time: 5:23:36 | Steps: 8788 | Loss: 94.396409
Epoch 6 | Training | Elapsed Time: 5:23:14 | Steps: 8788 | Loss: 94.702216
Epoch 7 | Training | Elapsed Time: 5:24:00 | Steps: 8788 | Loss: 96.500687
Epoch 8 | Training | Elapsed Time: 5:36:35 | Steps: 8788 | Loss: 98.647227
Epoch 9 | Training | Elapsed Time: 5:32:55 | Steps: 8788 | Loss: 102.399158

Validation loss (dev)
Epoch 0 | Validation | Elapsed Time: 0:00:51 | Steps: 168 | Loss: 59.215597
Epoch 1 | Validation | Elapsed Time: 0:00:50 | Steps: 168 | Loss: 49.190173
Epoch 2 | Validation | Elapsed Time: 0:00:49 | Steps: 168 | Loss: 44.888373
Epoch 3 | Validation | Elapsed Time: 0:00:50 | Steps: 168 | Loss: 43.367872
Epoch 4 | Validation | Elapsed Time: 0:00:50 | Steps: 168 | Loss: 42.031243
Epoch 5 | Validation | Elapsed Time: 0:00:50 | Steps: 168 | Loss: 42.560297
Epoch 6 | Validation | Elapsed Time: 0:00:49 | Steps: 168 | Loss: 42.656643
Epoch 7 | Validation | Elapsed Time: 0:00:50 | Steps: 168 | Loss: 43.767814
Epoch 8 | Validation | Elapsed Time: 0:00:54 | Steps: 168 | Loss: 44.755181
Epoch 9 | Validation | Elapsed Time: 0:00:50 | Steps: 168 | Loss: 47.417216

Best model is saved after epoch 4. The training is continuing but it feels just not right when encountering this case in early epoch.

Default dropout is really low with 0.05, set it to 0.25 or higher. That should change things. And train with GPU if possible.

Thanks for your reply! I’ll try higher dropout ratio. I am using 8G 1080 tho.

hi, @othiele
Since I have very limited training time, maybe 50 epoch. And I heard that dropout rate may lead to higher loss in early epoch but have best convergence loss which needs lots of training resource and time. So which dropout rate would you recommend? (I am guessing 0.25) Also learning rate may affect, too. Maybe using 0.001 is enough?
Thanks!

Learning rate of 0.0001 is normal for fresh data sets.

Split all data 75/15/10 for train/dev/test, yours looks different.

You have to experiment a bit, but 15 epochs should give ok results.

1 Like

Thanks for your reply!

Sorry, What does this mean?

idk man
v0.7.0 is trained with 325 epoch, the loss is 14 but WER is 12.2 (w/o LM and beam=1)
v0.6.1 is trained with 75 epoch, the loss is 17 but WER is 15.1 (w/o LM and beam=1)
Note that in these case LM optimizer won’t affect when evaluation
It seems like even if loss only drop 3 but WER improves a lot and that latter training epoch may be necessary.

I still need to train to convergence point(maybe WER<15?) so I am trying to make it converge more quickly.

I’ll do experiments. Just that asking some advice may be much more efficient.

About 15% of your data should be used for the dev/validation set. Looking at your numbers, this is more 1% for you. This will lead to bad results.

Ok, will do. Thanks!

Don’t go with fixed percentages, this will quickly lead to wasting training data. Do a statistical power analysis and go with a sample size that gives you good enough confidence in the results.

Just curious, did v0.7.0 pretrained model train with this?

  • train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora.
  • dev_files LibriSpeech clean dev corpus.
  • test_files LibriSpeech clean test corpus
  • train_batch_size 128
  • dev_batch_size 128
  • test_batch_size 128
  • n_hidden 2048
  • learning_rate 0.0001
  • dropout_rate 0.40
  • epochs 125

Its dev set and test set are very small. Just more training data compared to my case.

Also, does anyone know the reason of huge gap between v0.7.0 and v0.6.1 as I mentioned above? (or is it just because larger epoch?) I think this would influence my determination of training epoch.

The training details are included in the release notes.

LibriSpeech’s dev and test sets are indeed small, but they are almost spotless (no transcription errors). We have found that being error free makes a much bigger impact in the usefulness of a validation and testing set than just its size.

3 Likes

I think you mean

Training Regimen + Hyperparameters for fine-tuning

I did not read the 75/15/10 thing in that tho. But yeah I indeed read that thing in “train french model” post. So I guess this won’t affect too much when I am not training with custom dataset?

I am kinda confused. If “bigger impact” means positive then my dev/test sets have no problem right? (I’m just using librispeech 1k hr data apparently)
Btw, the loss reach 20 and WER reach 18 in 20 epoch. Maybe the trend is not that bad?

This is probably last question here since the thread is becoming out of focus. Sorry about that.

If you are training on the LibriSpeech, go with @reuben. If you have some selfmade dataset, I would use 10-15% as dev to get good results. If you have thousands of hours, this can be smaller.

1 Like

Much appreciated to @othiele and @reuben !

Hello @othiele, i would like to ask for advice on the training our own dataset.

Should the data of dev / test is come from the data of train? Or the data of dev / test must not as the same as the sentence from train?

And what’s the purpose of n_hidden variable? I change it to 2048 / 1024 / 512 and an initial train with the same lm.binary and kenlm.scorer. There are sometimes occurs “Segmentation fault” with 2048 n_hidden, but not occurs with 512 n_hidden. For using the different size of the dataset and with different n_hidden, the “Segmentation fault” may not occur again.

Although the “Segmentation fault” is related to TensorFlow, I was confused. When I use different n_hidden, it sometimes occurs and sometimes not.

The dev set is used to change parameters for the next epoch. Ideally this is good data that is very close to what you want to recognize later. @reuben said that they use very little, very excellent data for that. I don’t have that so I take just 7-10% of my training data for that.

For the test set, we built a special smaller set that represents all the different nuances that we want to evaluate. This is handcrafted and we use the same one for all trainings to evaluate not just training but real world fit.

n_hidden makes the neural net more complex. We found that there is not much difference between 2048 and 512 for our smaller datasets, but we don’t have 5000 hours :slight_smile:

Definitely increase the batch size as high as you can, this will speed up training.

I remember having this error just with bad data, shouldn’t happen with great data. @lissyx, what do you think?

Sometimes, I’m happy, sometimes I’m not. There’s not much you can do? Well there’s not much I can do with your report.

Thank you @othiele and @lissyx for your reply.

For the “Segmentation fault”, I understand that should be related to the bad data. But the bad data is it means the bad quality (too much noise) or the quantity is not enough?

Obviously you did not get my point: there’s nothing we can do with just “segmentation fault”: please share logs, gdb stack, etc. It could be corrupted WAV file, or one that makes tensorflow unhappy. Again, without anything actionable, we are all loosing our time thinking at nothing.

Understand. I will review all the WAV files make sure that there is no corrupt WAV file first.

If I catch the “segmentation fault” again, I will prepare the share logs, transcripts, WAV files, and command script for investigation in a NEW thread.

1 Like