Continue from the last epoch that was stopped at it, to keep the training hours

Suhad_Al-Issa · October 29, 2020, 10:50am

hello,
I am used DeepSpeech, and it was working correctly
I out my data corpus and everything is ok
The training takes many hours to complete, there was interruption was happened at me ( the electricity is off)
i load and save the checkpoints in the checkpoint folder, when I re-run it and restarted the training, I loaded the previous checkpoint that I saved, I change nothing in data or hyperparameters.
it loaded from the last checkpoint and work correctly, but it re-count from the beginning epoch ( epoch = 0) and continues.
is it this right?
it is possible to continue to from the last epoch that was stopped at it, to keep the training hours.?

Suhad_Al-Issa · October 29, 2020, 10:54am

@lissyx
can you comment on this please ?

sanjaesc · October 29, 2020, 11:23am

You posted in the wrong forum. This is for text-to-speech

othiele · October 29, 2020, 12:51pm

As @sanjaesc said, move the topic over to STT!

Sorry to hear that.

Yes, if you provide a checkpoint dir, the training will resume from there.

Yes, this is what you are doing, but if you restart it restarts counting the epochs from 0. But you are using the current checkpoint.

Suhad_Al-Issa · October 29, 2020, 1:22pm

@othiele
thank you
this is mean that i do right?

Suhad_Al-Issa · October 29, 2020, 1:25pm

i did not notice this
ok i will move it

othiele · October 29, 2020, 1:48pm

Yes, you are doing it right if it says something like “loading checkpoint” in the output.

And thanks for moving the post.

Suhad_Al-Issa · October 29, 2020, 1:55pm

yes this is what i want to say
but i was asking about when i load the checkpoint in the output , it reastart count from epoch 0 instead of continue from the last epoch that is stopped at it ?

othiele · October 29, 2020, 2:05pm

Yes, this is OK. You start from epoch 0 + the epochs you already trained. The checkpoint does not store information on epochs or losses.

Suhad_Al-Issa · October 29, 2020, 6:06pm

no he does not start from 0+ the epoch i already trained
it always go and start from the 0 from the begining again not from where it was reached before
how i can let it to continue from where it was stopped and continue??

lissyx · October 29, 2020, 6:08pm

We can’t help there if you don’t at least share some informations:

content of your checkpoint dir
command line
stdout/stderr

Suhad_Al-Issa · October 30, 2020, 7:16am

ok

i put four screenshots below
i want it to complete from epoch 68 that stopped at it
but it restart again from epoch 0when i re-run it
i save and the checkpoint in the same folder
when i re-run again
i put the the previous checkpoint folder
for the same run without change anything
and it was restart from the last best checkpoint that reached
when it start again
it it counts from epoch 0 not from epoch 68
?? how i can let it continue from epoch 68

???

othiele · October 30, 2020, 8:21am

Let me try it another way:

Training always starts at epoch 0 as the checkpoint does not store which epoch it is in.

If you continue training, it will say epoch 0, but you are in fact training epoch 69.

The checkpoint is just the frozen net in it’s last trained state. An epoch ist just one full round of training with your data.

You can’t see epoch 69 on the screen, you have to write that down somewhere.

Suhad_Al-Issa · October 30, 2020, 8:24am

ok
how i can write this
to continue
from 69
??

othiele · October 30, 2020, 8:28am

OK, this is my last try to make this clear:

Take a piece of paper, write 68 on it and add the new epochs by writing “+1”.

There is no way to get DeepSpeech to do that for you.

lissyx · October 30, 2020, 8:35am

Seriously, avoid screenshots.

Suhad_Al-Issa · October 30, 2020, 8:47am

ok
in the next time i wlli avoid it
now for my question
have you any help ??

lissyx · October 30, 2020, 9:00am

No, because i cant read your screenshots.

lissyx · October 30, 2020, 9:15am

As I asked, please share full training logs as pure text and text-based listing of your checkpoint directory, including showing the path.

Screenshots: unreadable from email, unable to search for content, limited view of the full data.

If you continue to refuse to share actionable items, we will be unable to help you.

lissyx · October 30, 2020, 9:16am

Also, as much as I remember, the epoch value itself is not being saved into the checkpoint, so when you reload one it can be normal the next training starts from 0.

Topic		Replies	Views
Model doesnt train second time from the checkpoint DeepSpeech	20	759	September 5, 2019
Train 1 epoch take too much time DeepSpeech	1	802	April 3, 2019
Training from a checkpoint DeepSpeech	0	517	April 11, 2019
Training does not resume from checkpoint. Always restarts from epoch 0 DeepSpeech	2	1962	April 6, 2019
Unable to load steps from checkpoint when training model DeepSpeech learning	22	1080	June 5, 2020

Continue from the last epoch that was stopped at it, to keep the training hours

Related topics