Unable to load steps from checkpoint when training model

Hello all, my dataset for the DeepSpeech model consists of 28MB wav audio files of around 15 mins in length. They are all audio recordings of telephone calls, and I would like to use DeepSpeech to create transcripts.

Train set: 1681 files
Dev set: 219 files
Test set: 209 files

Have followed the instructions and started training a model of my own. I have passed in the following flags for training:

python3 DeepSpeech.py --train_files /home/stieon/malay_dataset/converted_train_audio/train.csv --dev_files /home/stieon/malay_dataset/converted_dev_audio/dev.csv --test_files /home/stieon/malay_dataset/converted_test_audio/test.csv --epochs 35 --checkpoint_dir /home/stieon/checkpoint/ --export_dir /home/stieon/model_dir --train_batch_size 1 --dev_batch_size 1 --test_batch_size 1 --automatic_mixed_precision --learning_rate 0.0001 --scorer /home/stieon/malay_dataset/malay.scorer --train_cudnn true

I have set the train, dev and test batch size all to 1 because setting it to 2 or above would result in OOM errors. (I’m assuming this is due to the long audio recording duration.) This causes my training time to be extremely long, even though I am using a NVIDIA Tesla T4 GPU. (Not much issue with that)

However, when training the model from checkpoints, I noticed that it starts from step 0 whenever I start the training. This is frustrating because I have to wait for all (1681 steps) to finish training, and each step takes around 2 mins to finish. And, this only for one epoch.

This is what I got when loading checkpoints:

I Enabling automatic mixed precision training.
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /home/stieon/checkpoint/train-1007
I Loading variable from checkpoint: cond_1/beta1_power
I Loading variable from checkpoint: cond_1/beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: current_loss_scale
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: good_steps
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization

Any help would greatly be appreciated on how I can continue training from the particular step that I stopped at. Thank you!!

Are u working on English database,
I am a newbie just like you, can i get to know the source of your dataset

And according to me you didn’t specify the checkpoint you want to load properly

Can you also specify which checkpoint you are using

Hi yes, the dataset is from telephone calls, with transcripts in the following format:

[0.000]

[6.385]
helo
[7.821]
this is a test
…
…
…

so on and so forth.

I have preprocessed them to contain only the text.

I did specify the checkpoint directory to which to load the checkpoint files from.

According to the docs:
“The purpose of checkpoints is to allow interruption (also in the case of some unexpected failure) and later continuation of training without losing hours of training time. Resuming from checkpoints happens automatically by just (re)starting training with the same --checkpoint_dir of the former run.”

I did that but it doesn’t really seem to me like the training is resuming from the particular step I left off.

I meant --checkpoint_dir /home/stieon/checkpoint/deepspeech-0.6.1-checkpoint

or any other version

Have you made these datasets your self or downloaded from site.

If site:
please mention site

If yourself:
please guide me a little how to create data set

your code is running fine its just you didn’t specify checkpoint path properly

Since you don’t share your training log, we can’t check. Checkpoints are done on a regular basis, are you sure one has been produced?

Hi yes I am using DeepSpeech 0.7.1 and the checkpoint directory is this:
–checkpoint_dir /home/stieon/checkpoint/

^Not sure if I answered your question correctly. Seems to be loading the checkpoint correctly from this output:

I Loading most recent checkpoint from /home/stieon/checkpoint/train-1007

Just that it starts from step 0 again after loaded.

As for the datasets, I got the dataset from my company so I did the preprocessing myself.

What I did:

  1. Combined the entire txt file into a single line.
  2. Removed all brackets and strings within brackets
  3. Resulted in each transcript being a single line of characters, with multispaces in between each of the sentences.

Checking the directory I see the following in the checkpoint file:

model_checkpoint_path: “/home/stieon/checkpoint/train-1237”
all_model_checkpoint_paths: “/home/stieon/checkpoint/train-1203”
all_model_checkpoint_paths: “/home/stieon/checkpoint/train-1212”
all_model_checkpoint_paths: “/home/stieon/checkpoint/train-1221”
all_model_checkpoint_paths: “/home/stieon/checkpoint/train-1228”
all_model_checkpoint_paths: “/home/stieon/checkpoint/train-1237”

Seems to me like the checkpoints are saved correctly.

Any guidance on how I can get the training log?

training log is on standard output …

Le ven. 5 juin 2020 Ă  10:25, Shaun Tieon via Mozilla Discourse notifications@discourse.mozilla.org a Ă©crit :

Yes so here is my full training log. The model starts training from step 0 even though I have previously trained it until step 200+. After running the above code I get this output and model trains from scratch.

I Enabling automatic mixed precision training.
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /home/stieon/checkpoint/train-1245
I Loading variable from checkpoint: cond_1/beta1_power
I Loading variable from checkpoint: cond_1/beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: current_loss_scale
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: good_steps
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000

Can you give some stats on the individual audio sizes? Typically they should be 5-15 seconds max.

The audio files are about 28MB long after converting into the format DeepSpeech recognises, and are about 15 minutes long with blank spaces in between.

Understand that the audio files should be kept short but was trying out to see if DeepSpeech can process these longer audio files.

So it loaded from checkpoint … I don’t get your problem?

The issue is while the training log says it is loaded from checkpoint:
I Loading most recent checkpoint from /home/stieon/checkpoint/train-1245

The training line starts from step 0 again.

This is the line:
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000

And the model starts training from step 0 instead of the supposed step 1245. I assume that it will have to reach 1681 steps to finish epoch 0, and if it is starting from step 0 again, doesn’t it mean that previous training time was wasted?

Or does the model continue from step 0, but only has to complete the remaining (1681 - 1245 = 436 steps) ?

Please correct me if I’m wrong about this. Thanks for all those who’ve helped so far cheers!!

This whole thread is based on this assumption, which you could have just checked… The training logs always report per-epoch relative steps, not the global step.

The fact that the log is explicitly saying “Loaded variables from checkpoint”, and the fact that the checkpoint dir contains checkpoints with a global step value > 0, show that it is loading and saving the checkpoints properly.

The training code does not resume partially completed epochs, it always runs complete epochs, as described in the documentation for the --epochs flag.

Thanks for the clarification.

Is there any way to resume partially completed epochs, because one epoch for my dataset would take a long time to train.

Not currently. Would gladly review and help with a pull request if you have interest though.