Resume a Fine-tuning model

remorais · November 17, 2020, 5:40am

Hi guys,

I’m doing the following in order to do a fine-tuning model. However, I faced a power issue and lost the running process.

The lastest log that I have says that stopped at epoch 2 (please check the log below).

My question is: Is there any way to resume from this point? What’s the correct way?

Maybe the parameter for that is the --load_checkpoint_dir? or I had must trained using --save_checkpoint_dir or something before started?

I’m using Deepspeech v0.9.1

Thank you

python3 DeepSpeech.py --n_hidden 2048
–checkpoint_dir fine_tuning_checkpoints/
–epochs 3
–train_files /data/librivox-train-clean-100.csv
–dev_files /data/librivox-dev-clean.csv
–test_files /data/librivox-test-clean.csv
–learning_rate 0.0001
–export_dir output_models/
–train_cudnn

I1116 21:34:23.325432 140401849284480 utils.py:141] NumExpr defaulting to 2 threads.
I Loading best validating checkpoint from fine_tuning_checkpoints/best_dev-1466475
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 3:36:24 | Steps: 28539 | Loss: 21.645701
Epoch 0 | Validation | Elapsed Time: 0:13:15 | Steps: 2703 | Loss: 15.384955 | Dataset: /data/librivox-dev-clean.csv
I Saved new best validating model with loss 15.384955 to: fine_tuning_checkpoints/best_dev-1495014

Epoch 1 | Training | Elapsed Time: 3:37:58 | Steps: 28539 | Loss: 19.061899
Epoch 1 | Validation | Elapsed Time: 0:09:12 | Steps: 2703 | Loss: 15.219610 | Dataset: /data/librivox-dev-clean.csv
I Saved new best validating model with loss 15.219610 to: fine_tuning_checkpoints/best_dev-1523553

Epoch 2 | Training | Elapsed Time: 0:08:18 | Steps: 2303 | Loss: 6.402097

othiele · November 17, 2020, 8:55am

I know TTS and STT are confusing, post this over in STT. This forum is about generating voice.

remorais · November 17, 2020, 3:01pm

Hi Othiele,

I’m really sorry. I’ll post there.

Thank you

othiele · November 17, 2020, 3:59pm

Look for this dir, it has the checkpoint in it that you can use to start training from. To do that use the same command as before

–checkpoint_dir fine_tuning_checkpoints/

But it will start again at epoch 0, so just add that to the epochs you already trained before. And use a dropout of about 0.3 or 0.4

remorais · November 17, 2020, 5:28pm

Sorry. I didn’t understand how to prevent it from starting at epoch 0.

do you mean this?

–checkpoint_dir fine_tuning_checkpoints/
–epochs 3 \ # Not so sure about that
–augment dropout[p=0.1,rate=0.03]

othiele · November 17, 2020, 9:44pm

Sorry, I didn’t write that clear enough. There is no info on the number of epochs in the checkpoint, so counting starts at 0 even though you already did 2. Set epochs to 20 or whatever you think might be a good value that matches your hardware.

And use droput_rate from flags.py.

remorais · November 18, 2020, 12:53am

Hi Othiele, thank you for helping!

Regarding this flag.py, if I set --load_train “last”, means loading most recent epoch checkpoint?

f.DEFINE_string(‘load_train’, ‘auto’, ‘what checkpoint to load before starting the training process. “last” for loading most recent epoch checkpoint, “best” for loading best validation loss checkpoint, “init” for initializing a new checkpoint, “auto” for trying several options.’)

Which value of dropout_rate do you recommend? In the flags.py has the following:

f.DEFINE_float('dropout_rate', 0.05, 'dropout rate for feedforward layers')
f.DEFINE_float('dropout_rate2', -1.0, 'dropout rate for layer 2 - defaults to dropout_rate')
f.DEFINE_float('dropout_rate3', -1.0, 'dropout rate for layer 3 - defaults to dropout_rate')
f.DEFINE_float('dropout_rate4', 0.0, 'dropout rate for layer 4 - defaults to 0.0')
f.DEFINE_float('dropout_rate5', 0.0, 'dropout rate for layer 5 - defaults to 0.0')
f.DEFINE_float('dropout_rate6', -1.0, 'dropout rate for layer 6 - defaults to dropout_rate')

othiele · November 18, 2020, 9:21am

The best and last are defined in the checkpoint text file. Depends on where you want to restart training. In your case probably the last one.

For beginners, set just the droput_rate to something like 0.3 or 0.4. Leave the others on default.

remorais · November 18, 2020, 7:49pm

Thanks. I was able to resume and now it’s running.

So, even it’s showing epoch 0, that doesn’t means that the process started from the begining, is that right?

The currently output here:

I Loading best validating checkpoint from fine_tuning_checkpoints/best_dev-1523553

I Loading best validating checkpoint from fine_tuning_checkpoints/best_dev-1523553
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:18:44 | Steps: 4403 | Loss: 22.183089

othiele · November 18, 2020, 9:11pm

Yes, exactly. You always continue training from a checkpoint. Without one you would start over.

remorais · November 18, 2020, 10:10pm

Perfect Olaf.

Thank you for your help!

Topic		Replies	Views
Continue from the last epoch that was stopped at it, to keep the training hours DeepSpeech	19	1158	October 30, 2020
Training does not resume from checkpoint. Always restarts from epoch 0 DeepSpeech	2	1980	April 6, 2019
Unable to load steps from checkpoint when training model DeepSpeech learning	22	1100	June 5, 2020
Fine tuning pre-trained checkpoint model DeepSpeech	12	2801	July 27, 2020
Model doesnt train second time from the checkpoint DeepSpeech	20	789	September 5, 2019

Resume a Fine-tuning model

Epoch 1 | Training | Elapsed Time: 3:37:58 | Steps: 28539 | Loss: 19.061899 Epoch 1 | Validation | Elapsed Time: 0:09:12 | Steps: 2703 | Loss: 15.219610 | Dataset: /data/librivox-dev-clean.csv I Saved new best validating model with loss 15.219610 to: fine_tuning_checkpoints/best_dev-1523553

Related topics

Epoch 1 | Training | Elapsed Time: 3:37:58 | Steps: 28539 | Loss: 19.061899
Epoch 1 | Validation | Elapsed Time: 0:09:12 | Steps: 2703 | Loss: 15.219610 | Dataset: /data/librivox-dev-clean.csv
I Saved new best validating model with loss 15.219610 to: fine_tuning_checkpoints/best_dev-1523553