Training does not resume from checkpoint. Always restarts from epoch 0

ramaniaditya22 · April 6, 2019, 4:34am

Hi, I was trying to train my own model over on google colab.

Everything seemed to be working fine a few days ago but today all of a sudden the training does not resume from the checkpoints anymore.

As you can see below, the logs do say that checkpoint has been restored but the model starts training back from epoch 0. I also tried using the checkpoint from the 0.4.1 release but no luck. It always starts back from epoch 0.

Has anything in the code been changed over the last 3-4 days ? Cause prior to that it used to work without any problem

Here is the log from my run.

python -u DeepSpeech.py --train_files /content/SubGen/scripts/train.csv --dev_files /content/SubGen/scripts/dev.csv --test_files /content/SubGen/scripts/val.csv --train_batch_size 12 --dev_batch_size 12 --test_batch_size 12 --n_hidden 2048 --epoch -6 --validation_step 1 --early_stop True --earlystop_nsteps 6 --estop_mean_thresh 0.1 --estop_std_thresh 0.1 --dropout_rate 0.1 --learning_rate 0.0001 --report_count 100 --use_seq_length False --export_dir /gdrive/My Drive/exported_models/ --checkpoint_dir /gdrive/My Drive/deepspeech-0.4.1-checkpoint
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It is easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I Restored variables from most recent checkpoint at /gdrive/My Drive/deepspeech-0.4.1-checkpoint/train-14580, step 14580
I STARTING Optimization
I Training epoch 0...
  2% (67 of 2916) |                       | Elapsed Time: 0:01:05 ETA:   0:46:00

Thanks!

reuben · April 6, 2019, 10:53am

We changed the logic to be less confusing when restoring from an existing checkpoint. If you look closely you’ll see that the checkpoint is being loaded correctly:

I Restored variables from most recent checkpoint at /gdrive/My Drive/deepspeech-0.4.1-checkpoint/train-14580, step 14580

Note also that we changed the --epoch to --epochs, and it’s now always relative, you don’t have to use negative numbers.

reuben · April 6, 2019, 10:53am

Also, you could have easily answered this question yourself by simply looking at the commit log? Commits · mozilla/DeepSpeech · GitHub

Topic		Replies	Views
Checkpoint resuming-no. of epochs do not match on google colab DeepSpeech	5	774	April 26, 2019
Training from a checkpoint DeepSpeech	0	516	April 11, 2019
What are the essential Checkpoint files to resume training a Deepspeech model? DeepSpeech	6	1245	December 8, 2019
Questions regarding training arguments DeepSpeech learning	1	415	January 25, 2020
Continue from the last epoch that was stopped at it, to keep the training hours DeepSpeech	19	1111	October 30, 2020

Training does not resume from checkpoint. Always restarts from epoch 0

Related topics