Loss spikes after restoring checkpoints

shan18 · April 11, 2021, 12:24pm

I have been encountering a strange issue where initially the model trains and converges properly. However, when I pause training and resume training later by restoring the model from the checkpoint, the loss value seems to spike unexpectedly from the previous loss (before pausing) and resumes decreasing from that higher loss point.

Here is a snapshot of the initial training log

Epoch 42 |   Training | Elapsed Time: 2:40:25 | Steps: 6275 | Loss: 4.610984
Epoch 42 | Validation | Elapsed Time: 0:15:08 | Steps: 784 | Loss: 4.063824 | Dataset: data/dev_distributed.csv
I Saved new best validating model with loss 4.063824 to: /work/ssd/DeepSpeech/checkpoints/best_dev-1002347
--------------------------------------------------------------------------------
Epoch 43 |   Training | Elapsed Time: 2:43:34 | Steps: 6275 | Loss: 4.554272
Epoch 43 | Validation | Elapsed Time: 0:15:04 | Steps: 784 | Loss: 4.529190 | Dataset: data/dev_distributed.csv
--------------------------------------------------------------------------------
Epoch 44 |   Training | Elapsed Time: 2:40:47 | Steps: 6275 | Loss: 4.502114
Epoch 44 | Validation | Elapsed Time: 0:15:07 | Steps: 784 | Loss: 4.139136 | Dataset: data/dev_distributed.csv
--------------------------------------------------------------------------------
Epoch 45 |   Training | Elapsed Time: 2:40:22 | Steps: 6275 | Loss: 4.461294
Epoch 45 | Validation | Elapsed Time: 0:15:02 | Steps: 784 | Loss: 4.373661 | Dataset: data/dev_distributed.csv
--------------------------------------------------------------------------------
Epoch 46 |   Training | Elapsed Time: 2:42:05 | Steps: 6275 | Loss: 4.397513
Epoch 46 | Validation | Elapsed Time: 0:16:35 | Steps: 784 | Loss: 3.998155 | Dataset: data/dev_distributed.csv
I Saved new best validating model with loss 3.998155 to: /work/ssd/DeepSpeech/checkpoints/best_dev-1027447
--------------------------------------------------------------------------------

After 46th epoch, I killed the training and then resumed it again from the best_dev-1027447 checkpoint. Here is a snapshot of the training that followed after restoring the checkpoint

Epoch 47 |   Training | Elapsed Time: 2:37:44 | Steps: 6275 | Loss: 5.759453
Epoch 47 | Validation | Elapsed Time: 0:14:59 | Steps: 784 | Loss: 4.164374 | Dataset: data/dev_distributed.csv
I Saved new best validating model with loss 4.164374 to: /work/ssd/DeepSpeech/checkpoints/best_dev-1033722
--------------------------------------------------------------------------------
Epoch 48 |   Training | Elapsed Time: 2:38:58 | Steps: 6275 | Loss: 5.537706
Epoch 48 | Validation | Elapsed Time: 0:15:36 | Steps: 784 | Loss: 4.149571 | Dataset: data/dev_distributed.csv
I Saved new best validating model with loss 4.149571 to: /work/ssd/DeepSpeech/checkpoints/best_dev-1039997
--------------------------------------------------------------------------------
Epoch 49 |   Training | Elapsed Time: 2:38:21 | Steps: 6275 | Loss: 5.581811
Epoch 49 | Validation | Elapsed Time: 0:15:08 | Steps: 784 | Loss: 4.712744 | Dataset: data/dev_distributed.csv
--------------------------------------------------------------------------------

As you can see, the loss value increases after restoring the checkpoints. Can anyone help me debug this or tell me why does this happen?

Additional Details:

DeepSpeech version: 0.8.2
Command used:

python3 DeepSpeech.py --train_files train.csv --dev_files dev.csv --augment <some_augmentations> --train_cudnn --use_allow_growth --cache_for_epochs 0