What are the essential Checkpoint files to resume training a Deepspeech model?

Hi Everyone,

If I want continue training a model, and want to restore the checkpoints from a previously backed up location, do I have to recopy ALL the files or will a smaller subset do the job? Which ones if not all are required?

I am training my model (using version 051) in small batches of input data due to OOM issues. I have already trained on the Mozilla Voice corpus. Using the checkpoints directory, I want to to continue training on contents extracted from certain pdf files. I will be doing this one at a time. As I encounter OOM failures, I keep backups of the checkpoints directory, restore it to working checkpoint directory and resume training. But this back checkpoints directory size increases over time.

Now I am facing problems as its hitting sizes of 30 GB and over.

For example: my backup checkpoint directory contains following files:
best_dev-143.data-00000-of-00001,
best_dev_checkpoint,
train-110.index,
train-121.index,
train-132.index,
train-143.index,
train-99.index
best_dev-143.index,
checkpoint,
train-110.meta,
train-121.meta,
train-132.meta,
train-143.meta,
train-99.meta
best_dev-143.meta,
train-110.data-00000-of-00001,
train-121.data-00000-of-00001,
train-132.data-00000-of-00001,
train-143.data-00000-of-00001,
train-99.data-00000-of-00001

Which files should I copy into working checkpoint directory to resume training successfully? Is there any reason to retain the rest of the files?
Hope you can guide me on this.
Regards,
Rohit

This is all handled by TensorFlow itself. If you are limited in space, you can look into checkpoint file and see the current one used. That would be enough. But yeah, TensorFlow does keep a few of them around.

Thank you for the input @lissyx. Based on your answer is this how I should proceed?

My “checkpoint” file is as follows:

model_checkpoint_path: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/train-143”
all_model_checkpoint_paths: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/train-99”
all_model_checkpoint_paths: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/train-110”
all_model_checkpoint_paths: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/train-121”
all_model_checkpoint_paths: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/train-132”
all_model_checkpoint_paths: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/train-143”

The “best_dev_checkpoint” contents are as follows:

model_checkpoint_path: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/best_dev-143”
all_model_checkpoint_paths: “/home/rohit/dpspTraining/models/v051/model2-domainSet_1_20-260total/checkpointDir/best_dev-143”

Does this mean I only need to keep:
Option 1) All the train-XXXX numbered files from “checkpoint” file?
or
Option 2) Since the “best_dev_checkpoint” only mentions number 143, I can delete all the train-XXX.index / meta / .data-00000-of-00001 where XXX is anything other than 143.

Regards,
Rohit

I would think so. But honestly, that looks like a lot of workaround. I’m not sure to understand why you have to do all of that.

Thank you @lissyx. Will try option 2 then.

The reason for all this is:
The above is just an example which has very few files. The real training checkpoint directory already has around 140 files totaling around 28 gigs. And with OOM issues, I need to divide the new data in to smaller chunks which creates even more checkpoints and increases the size even more. So needing to backup only exactly what checkpoint files are actually needed will save space for me. Will let you know how this works out. Thanks again!

That look not normal. Are those stale checkpoints from previous runs ? Maybe you should just clean everything after OOM.

You should just run experiments, binary search on batch size, to find the maximum value that does not OOM. Then nuke stale checkpoints.

I started with the contents of the validation.tsv file of Mozilla voice corpus using the first 550k entries. Found a sweet spot on my config where OOM did not occur: input data broken into chunks of 60k data points. Train/dev/test batch sizes = 32/16/16. nhidden = 2048.

Not knowing what to retain, I kept all the checkpoint files = 110 files = 20gb approx.

Lets call the above as my BASELINE model.

Then for domain training I took my first pdf files contents = 13000 data points. I expected it to train without a hitch as the input data size much smaller than the 60k training input batches. But with the 32/16/16 batch size got OOM problem.

So I divided into smaller sizes of approx 3600 data points with batch sizes = 16/8/8. That worked but the checkpoint directory became 140 files = 30 gb by the time I was done with all the 13000 files.

Then I changed approach: Restored checkpoint directory to baseline model. Gave entire 13k data points with batch sizes = 8/4/4. The training was much slower but no OOM problem. At this stage the checkpoint directory has around 120 files = 22gb, which is much better than 140 files = 30gb!

Now I need to move to the next pdf file which should create another 15000 data points.

I will use your inputs to prune checkpoint files and reduce the size. Hopefully I can keep the size under control this way.