Model doesnt train second time from the checkpoint

sanjay.pandey · August 30, 2019, 11:18am

Hope you are doing well.

I trained my model on checkpoint 0.4.1 and the result i got is loss of 11 and then i took another dataset and now training further from the checkpoint but i dont know why it doesnt train it just goes directly to computing acoustic part and give test result but not able to train further once i train my model on the released checkpoint i am unable to train it second time. Can you direct me what i am doing wrong? Thank you so much in advance

lissyx · August 30, 2019, 12:06pm

We can’t do anything unless you share us what you do and your console output.

sanjay.pandey · August 30, 2019, 12:14pm

Hello @lissyx thank you for your prompt reply.

I am running the same command which i ran first time when i trained on 0.4.1 checkpoint just changing the name of and path of train,test,dev file apart from that i am doing everything same.

Here is command

First time while training from 0.4.1 checkpoint
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/sanjjayyp/deepspeech-0.4.1-checkpoint/ --epochs 300 --train_files /home/sanjjayyp/total_data_train.csv --dev_files /home/sanjjayyp/total_data_dev.csv --test_files /home/sanjjayyp/total_data_test.csv --learning_rate 0.0001 --export_dir /home/sanjjayyp/export_total_data_0_4_1/

Second time while training from 0.4.1 checkpoint

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/sanjjayyp/deepspeech-0.4.1-checkpoint/ --epochs 300 --train_files /home/sanjjayyp/petpooja_data_train.csv --dev_files /home/sanjjayyp/petpooja_data_dev.csv --test_files /home/sanjjayyp/petpooja_data_test.csv --learning_rate 0.0001 --export_dir /home/sanjjayyp/export_total_data_0_4_1/

Cant we keep feeding the model from the same checkpoint? Not able to do for second time it just gives directly prediction result.

reuben · August 30, 2019, 12:19pm

The documentation explains what’s happening here: https://github.com/mozilla/DeepSpeech/tree/v0.4.1#continuing-training-from-a-release-model

lissyx · August 30, 2019, 12:22pm

You lack the console output as well. Please format everything using code format options for readability.

sanjay.pandey · August 30, 2019, 1:06pm

Hey reuben

I am doing the same what is mentioned but for the second time it doesnt train it just directly start computing acoustic and give result.

sanjay.pandey · August 30, 2019, 1:09pm

This is happening in my terminal here is a screenshot

lissyx · August 30, 2019, 8:02pm

Please, I asked for text not a screenshot.

sanjay.pandey · September 1, 2019, 6:11am

Hey @lissyx can you please come clear what exactly you want? I already sent you text for command and also the console output.
I am sorry for my lack of understanding but I really need to solve this so please can you tell me your requirement to solve this error?

sanjay.pandey · September 1, 2019, 6:13am

Hey @reuben here is a screenshot for your reference I am doing the same command but it doesn’t train for the second time after I trained successfully on the checkpoint released by deepspeechmozilla team.

lissyx · September 1, 2019, 3:40pm

I need the output that you continue to post as screenshot. Please send it as text otherwise I can’t help.

sanjay.pandey · September 3, 2019, 6:57am

Hello @lissyx

Here is the output

python3 DeepSpeech.py–n_hidden 2048–checkpoint_dir / home / sanjjayyp / deepspeech - 0.4 .1 - checkpoint / --epochs - 3–train_files / home / sanjjayyp / petpooja_data_train.csv–dev_files / home / sanjjayyp / petpooja_data_dev.csv–test_files / home / sanjjayyp / petpooja_data_test.csv–learning_rate 0.0001–export_dir / home / sanjjayyp / export_num_webapp /
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / python / framework / op_def_library.py: 263: colocate_with(from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions
for updating:
Colocations handled automatically by placer.
Preprocessing[‘/home/sanjjayyp/petpooja_data_train.csv’]
Preprocessing done
Preprocessing[‘/home/sanjjayyp/petpooja_data_dev.csv’]
Preprocessing done
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / contrib / rnn / python / ops / lstm_ops.py: 696: to_int64(from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions
for updating:
Use tf.cast instead.
W Parameter–validation_step needs to be > 0
for early stopping to work
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / python / training / saver.py: 1266: checkpoint_exists(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions
for updating:
Use standard file APIs to check
for files with this prefix.
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / python / training / saver.py: 1070: get_checkpoint_mtimes(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions
for updating:
Use standard file utilities to get mtimes.
Preprocessing[‘/home/sanjjayyp/petpooja_data_test.csv’]
Preprocessing done
Computing acoustic model predictions…
100 % (182 of 182) | ###################################################### | Elapsed Time: 0: 00: 06 Time: 0: 00: 06
Decoding predictions…
100 % (182 of 182) | ######################################################################################################## | Elapsed Time: 0: 00: 16 Time: 0: 00: 16
Test - WER: 0.758242, CER: 0.578792, loss: 15.807545

sanjay.pandey · September 3, 2019, 6:58am

Hey @reuben any update for the problem?

reuben · September 3, 2019, 7:29am

You’re passing the wrong flag name, please read the documentation carefully (with python DeepSpeech.py --helpfull). In v0.4.1 the flag was called --epoch, not --epochs. https://github.com/mozilla/DeepSpeech/blob/v0.4.1/util/flags.py#L41

sanjay.pandey · September 3, 2019, 8:57am

Hey @reuben the surprising thing is that first when i trained using --epochs flag it was able to train.But now using the same epochs it is giving directly prediction after further feeding the continuation from the checkpoint.
Just now i used --epoch as well but still it gives directly prediction without getting trained on data. Is there any problem in checkpoint if we feed data for the second time?

sanjay.pandey · September 3, 2019, 9:07am

I got it corrected. Using -(negative) with epoch started the training.Thank you so much. But i dont know how it was able to train for the first time by passing epochs without using negative sign.

reuben · September 3, 2019, 9:11am

The flag has a default value.

sanjay.pandey · September 4, 2019, 9:22am

I trained on some other 64k audio sets and exported the model and result was quite good then I trained on 600 audio sets from the same checkpoint and again exported the model. The problem which I am facing now is it does give inference perfectly of words in 600 audio but not doesn’t give any correct inference when words said from the 64k audio. Instead the result i was expecting was in total it will give me inference of both 64k and 600 audios but it is only giving inference of word said from 600.

kdavis · September 4, 2019, 10:39am

How many epochs did you fine-tune for? Why not start with a small number, e.g. one, to try to avoid catastrophic forgetting?

sanjay.pandey · September 5, 2019, 9:08am

thank you so much @kdavis
Training on lesser epoch like two or three did the work and avoided catastrophic forgetting.
But to lesser the loss i need more epochs and more epochs will lead to forgetting so i need to maitain that sweet spot or trade off between them?

Topic		Replies	Views
Training from a checkpoint DeepSpeech	0	522	April 11, 2019
Continue from the last epoch that was stopped at it, to keep the training hours DeepSpeech	19	1132	October 30, 2020
Training does not resume from checkpoint. Always restarts from epoch 0 DeepSpeech	2	1968	April 6, 2019
Checkpoint resuming-no. of epochs do not match on google colab DeepSpeech	5	778	April 26, 2019
Training pre-trained model DeepSpeech	0	600	April 16, 2019

Model doesnt train second time from the checkpoint

Related topics