Resume a Fine-tuning model

Hi guys,

I’m doing the following in order to do a fine-tuning model. However, I faced a power issue and lost the running process.

The lastest log that I have says that stopped at epoch 2 (please check the log below).

My question is: Is there any way to resume from this point? What’s the correct way?

Maybe the parameter for that is the --load_checkpoint_dir? or I had must trained using --save_checkpoint_dir or something before started?

I’m using Deepspeech v0.9.1

Thank you

python3 DeepSpeech.py --n_hidden 2048
–checkpoint_dir fine_tuning_checkpoints/
–epochs 3
–train_files /data/librivox-train-clean-100.csv
–dev_files /data/librivox-dev-clean.csv
–test_files /data/librivox-test-clean.csv
–learning_rate 0.0001
–export_dir output_models/
–train_cudnn

I1116 21:34:23.325432 140401849284480 utils.py:141] NumExpr defaulting to 2 threads.
I Loading best validating checkpoint from fine_tuning_checkpoints/best_dev-1466475
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 3:36:24 | Steps: 28539 | Loss: 21.645701
Epoch 0 | Validation | Elapsed Time: 0:13:15 | Steps: 2703 | Loss: 15.384955 | Dataset: /data/librivox-dev-clean.csv
I Saved new best validating model with loss 15.384955 to: fine_tuning_checkpoints/best_dev-1495014

Epoch 1 | Training | Elapsed Time: 3:37:58 | Steps: 28539 | Loss: 19.061899
Epoch 1 | Validation | Elapsed Time: 0:09:12 | Steps: 2703 | Loss: 15.219610 | Dataset: /data/librivox-dev-clean.csv
I Saved new best validating model with loss 15.219610 to: fine_tuning_checkpoints/best_dev-1523553

Epoch 2 | Training | Elapsed Time: 0:08:18 | Steps: 2303 | Loss: 6.402097

I know TTS and STT are confusing, post this over in STT. This forum is about generating voice.

1 Like

Hi Othiele,

I’m really sorry. I’ll post there.

Thank you

Look for this dir, it has the checkpoint in it that you can use to start training from. To do that use the same command as before

–checkpoint_dir fine_tuning_checkpoints/

But it will start again at epoch 0, so just add that to the epochs you already trained before. And use a dropout of about 0.3 or 0.4

Sorry. I didn’t understand how to prevent it from starting at epoch 0.

do you mean this?

–checkpoint_dir fine_tuning_checkpoints/
–epochs 3 \ # Not so sure about that
–augment dropout[p=0.1,rate=0.03]

Sorry, I didn’t write that clear enough. There is no info on the number of epochs in the checkpoint, so counting starts at 0 even though you already did 2. Set epochs to 20 or whatever you think might be a good value that matches your hardware.

And use droput_rate from flags.py.

Hi Othiele, thank you for helping!

Regarding this flag.py, if I set --load_train “last”, means loading most recent epoch checkpoint?

f.DEFINE_string(‘load_train’, ‘auto’, ‘what checkpoint to load before starting the training process. “last” for loading most recent epoch checkpoint, “best” for loading best validation loss checkpoint, “init” for initializing a new checkpoint, “auto” for trying several options.’)

Which value of dropout_rate do you recommend? In the flags.py has the following:

f.DEFINE_float('dropout_rate', 0.05, 'dropout rate for feedforward layers')
f.DEFINE_float('dropout_rate2', -1.0, 'dropout rate for layer 2 - defaults to dropout_rate')
f.DEFINE_float('dropout_rate3', -1.0, 'dropout rate for layer 3 - defaults to dropout_rate')
f.DEFINE_float('dropout_rate4', 0.0, 'dropout rate for layer 4 - defaults to 0.0')
f.DEFINE_float('dropout_rate5', 0.0, 'dropout rate for layer 5 - defaults to 0.0')
f.DEFINE_float('dropout_rate6', -1.0, 'dropout rate for layer 6 - defaults to dropout_rate')

The best and last are defined in the checkpoint text file. Depends on where you want to restart training. In your case probably the last one.

For beginners, set just the droput_rate to something like 0.3 or 0.4. Leave the others on default.

Thanks. I was able to resume and now it’s running.

So, even it’s showing epoch 0, that doesn’t means that the process started from the begining, is that right?

The currently output here:

I Loading best validating checkpoint from fine_tuning_checkpoints/best_dev-1523553

I Loading best validating checkpoint from fine_tuning_checkpoints/best_dev-1523553
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:18:44 | Steps: 4403 | Loss: 22.183089

Yes, exactly. You always continue training from a checkpoint. Without one you would start over.

Perfect Olaf.

Thank you for your help!