Training does not resume from checkpoint. Always restarts from epoch 0

Hi, I was trying to train my own model over on google colab.

Everything seemed to be working fine a few days ago but today all of a sudden the training does not resume from the checkpoints anymore.

As you can see below, the logs do say that checkpoint has been restored but the model starts training back from epoch 0. I also tried using the checkpoint from the 0.4.1 release but no luck. It always starts back from epoch 0.

Has anything in the code been changed over the last 3-4 days ? Cause prior to that it used to work without any problem

Here is the log from my run.

python -u DeepSpeech.py --train_files /content/SubGen/scripts/train.csv --dev_files /content/SubGen/scripts/dev.csv --test_files /content/SubGen/scripts/val.csv --train_batch_size 12 --dev_batch_size 12 --test_batch_size 12 --n_hidden 2048 --epoch -6 --validation_step 1 --early_stop True --earlystop_nsteps 6 --estop_mean_thresh 0.1 --estop_std_thresh 0.1 --dropout_rate 0.1 --learning_rate 0.0001 --report_count 100 --use_seq_length False --export_dir /gdrive/My Drive/exported_models/ --checkpoint_dir /gdrive/My Drive/deepspeech-0.4.1-checkpoint
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It is easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I Restored variables from most recent checkpoint at /gdrive/My Drive/deepspeech-0.4.1-checkpoint/train-14580, step 14580
I STARTING Optimization
I Training epoch 0...
  2% (67 of 2916) |                       | Elapsed Time: 0:01:05 ETA:   0:46:00

Thanks!

We changed the logic to be less confusing when restoring from an existing checkpoint. If you look closely you’ll see that the checkpoint is being loaded correctly:

I Restored variables from most recent checkpoint at /gdrive/My Drive/deepspeech-0.4.1-checkpoint/train-14580, step 14580

Note also that we changed the --epoch to --epochs, and it’s now always relative, you don’t have to use negative numbers.

1 Like

Also, you could have easily answered this question yourself by simply looking at the commit log? https://github.com/mozilla/DeepSpeech/commits/master