Model doesnt train second time from the checkpoint

Hello @reuben @kdavis

Hope you are doing well.

I trained my model on checkpoint 0.4.1 and the result i got is loss of 11 and then i took another dataset and now training further from the checkpoint but i dont know why it doesnt train it just goes directly to computing acoustic part and give test result but not able to train further once i train my model on the released checkpoint i am unable to train it second time. Can you direct me what i am doing wrong? Thank you so much in advance

We can’t do anything unless you share us what you do and your console output.

Hello @lissyx thank you for your prompt reply.

I am running the same command which i ran first time when i trained on 0.4.1 checkpoint just changing the name of and path of train,test,dev file apart from that i am doing everything same.

Here is command

First time while training from 0.4.1 checkpoint
python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/sanjjayyp/deepspeech-0.4.1-checkpoint/ --epochs 300 --train_files /home/sanjjayyp/total_data_train.csv --dev_files /home/sanjjayyp/total_data_dev.csv --test_files /home/sanjjayyp/total_data_test.csv --learning_rate 0.0001 --export_dir /home/sanjjayyp/export_total_data_0_4_1/

Second time while training from 0.4.1 checkpoint

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/sanjjayyp/deepspeech-0.4.1-checkpoint/ --epochs 300 --train_files /home/sanjjayyp/petpooja_data_train.csv --dev_files /home/sanjjayyp/petpooja_data_dev.csv --test_files /home/sanjjayyp/petpooja_data_test.csv --learning_rate 0.0001 --export_dir /home/sanjjayyp/export_total_data_0_4_1/

Cant we keep feeding the model from the same checkpoint? Not able to do for second time it just gives directly prediction result.

The documentation explains what’s happening here: https://github.com/mozilla/DeepSpeech/tree/v0.4.1#continuing-training-from-a-release-model

You lack the console output as well. Please format everything using code format options for readability.

Hey reuben

I am doing the same what is mentioned but for the second time it doesnt train it just directly start computing acoustic and give result.

This is happening in my terminal here is a screenshot

Please, I asked for text not a screenshot.

Hey @lissyx can you please come clear what exactly you want? I already sent you text for command and also the console output.
I am sorry for my lack of understanding but I really need to solve this so please can you tell me your requirement to solve this error?


Hey @reuben here is a screenshot for your reference I am doing the same command but it doesn’t train for the second time after I trained successfully on the checkpoint released by deepspeechmozilla team.

I need the output that you continue to post as screenshot. Please send it as text otherwise I can’t help.

Hello @lissyx

Here is the output

python3 DeepSpeech.py–n_hidden 2048–checkpoint_dir / home / sanjjayyp / deepspeech - 0.4 .1 - checkpoint / --epochs - 3–train_files / home / sanjjayyp / petpooja_data_train.csv–dev_files / home / sanjjayyp / petpooja_data_dev.csv–test_files / home / sanjjayyp / petpooja_data_test.csv–learning_rate 0.0001–export_dir / home / sanjjayyp / export_num_webapp /
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / python / framework / op_def_library.py: 263: colocate_with(from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions
for updating:
Colocations handled automatically by placer.
Preprocessing[’/home/sanjjayyp/petpooja_data_train.csv’]
Preprocessing done
Preprocessing[’/home/sanjjayyp/petpooja_data_dev.csv’]
Preprocessing done
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / contrib / rnn / python / ops / lstm_ops.py: 696: to_int64(from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions
for updating:
Use tf.cast instead.
W Parameter–validation_step needs to be > 0
for early stopping to work
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / python / training / saver.py: 1266: checkpoint_exists(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions
for updating:
Use standard file APIs to check
for files with this prefix.
WARNING: tensorflow: From / home / sanjjayyp / anaconda3 / envs / deepspeechmozilla / lib / python3 .6 / site - packages / tensorflow / python / training / saver.py: 1070: get_checkpoint_mtimes(from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions
for updating:
Use standard file utilities to get mtimes.
Preprocessing[’/home/sanjjayyp/petpooja_data_test.csv’]
Preprocessing done
Computing acoustic model predictions…
100 % (182 of 182) | ###################################################### | Elapsed Time: 0: 00: 06 Time: 0: 00: 06
Decoding predictions…
100 % (182 of 182) | ######################################################################################################## | Elapsed Time: 0: 00: 16 Time: 0: 00: 16
Test - WER: 0.758242, CER: 0.578792, loss: 15.807545

Hey @reuben any update for the problem?

You’re passing the wrong flag name, please read the documentation carefully (with python DeepSpeech.py --helpfull). In v0.4.1 the flag was called --epoch, not --epochs. https://github.com/mozilla/DeepSpeech/blob/v0.4.1/util/flags.py#L41

Hey @reuben the surprising thing is that first when i trained using --epochs flag it was able to train.But now using the same epochs it is giving directly prediction after further feeding the continuation from the checkpoint.
Just now i used --epoch as well but still it gives directly prediction without getting trained on data. Is there any problem in checkpoint if we feed data for the second time?

I got it corrected. Using -(negative) with epoch started the training.Thank you so much. But i dont know how it was able to train for the first time by passing epochs without using negative sign.

The flag has a default value.

I trained on some other 64k audio sets and exported the model and result was quite good then I trained on 600 audio sets from the same checkpoint and again exported the model. The problem which I am facing now is it does give inference perfectly of words in 600 audio but not doesn’t give any correct inference when words said from the 64k audio. Instead the result i was expecting was in total it will give me inference of both 64k and 600 audios but it is only giving inference of word said from 600.

How many epochs did you fine-tune for? Why not start with a small number, e.g. one, to try to avoid catastrophic forgetting?

thank you so much @kdavis
Training on lesser epoch like two or three did the work and avoided catastrophic forgetting.
But to lesser the loss i need more epochs and more epochs will lead to forgetting so i need to maitain that sweet spot or trade off between them?