Continue from the last epoch that was stopped at it, to keep the training hours

hello,
I am used DeepSpeech, and it was working correctly
I out my data corpus and everything is ok
The training takes many hours to complete, there was interruption was happened at me ( the electricity is off)
i load and save the checkpoints in the checkpoint folder, when I re-run it and restarted the training, I loaded the previous checkpoint that I saved, I change nothing in data or hyperparameters.
it loaded from the last checkpoint and work correctly, but it re-count from the beginning epoch ( epoch = 0) and continues.
is it this right?
it is possible to continue to from the last epoch that was stopped at it, to keep the training hours.?

@lissyx
can you comment on this please ?

You posted in the wrong forum. This is for text-to-speech :smile:

As @sanjaesc said, move the topic over to STT!

Sorry to hear that.

Yes, if you provide a checkpoint dir, the training will resume from there.

Yes, this is what you are doing, but if you restart it restarts counting the epochs from 0. But you are using the current checkpoint.

@othiele
thank you
this is mean that i do right?

i did not notice this
ok i will move it

Yes, you are doing it right if it says something like “loading checkpoint” in the output.

And thanks for moving the post.

yes this is what i want to say
but i was asking about when i load the checkpoint in the output , it reastart count from epoch 0 instead of continue from the last epoch that is stopped at it ?

Yes, this is OK. You start from epoch 0 + the epochs you already trained. The checkpoint does not store information on epochs or losses.

no he does not start from 0+ the epoch i already trained
it always go and start from the 0 from the begining again not from where it was reached before
how i can let it to continue from where it was stopped and continue??

We can’t help there if you don’t at least share some informations:

  • content of your checkpoint dir
  • command line
  • stdout/stderr

ok

i put four screenshots below
i want it to complete from epoch 68 that stopped at it
but it restart again from epoch 0when i re-run it
i save and the checkpoint in the same folder
when i re-run again
i put the the previous checkpoint folder
for the same run without change anything
and it was restart from the last best checkpoint that reached
when it start again
it it counts from epoch 0 not from epoch 68
?? how i can let it continue from epoch 68

Screenshot from 2020-10-30 08-57-58

???

Let me try it another way:

Training always starts at epoch 0 as the checkpoint does not store which epoch it is in.

If you continue training, it will say epoch 0, but you are in fact training epoch 69.

The checkpoint is just the frozen net in it’s last trained state. An epoch ist just one full round of training with your data.

You can’t see epoch 69 on the screen, you have to write that down somewhere.

ok
how i can write this
to continue
from 69
??

OK, this is my last try to make this clear:

Take a piece of paper, write 68 on it and add the new epochs by writing “+1”.

There is no way to get DeepSpeech to do that for you.

Seriously, avoid screenshots.

ok
in the next time i wlli avoid it
now for my question
have you any help ??

No, because i cant read your screenshots.

As I asked, please share full training logs as pure text and text-based listing of your checkpoint directory, including showing the path.

Screenshots: unreadable from email, unable to search for content, limited view of the full data.

If you continue to refuse to share actionable items, we will be unable to help you.

Also, as much as I remember, the epoch value itself is not being saved into the checkpoint, so when you reload one it can be normal the next training starts from 0.