Error in loading checkpoint (Key cond_1/beta1_power not found in checkpoint)

sayantangangs.91 · April 29, 2020, 8:15am

I started using DS 0.7. As usual, to check the accuacy of the the environment, I did run

./bin/run-ldc93s1.sh

The same has worked (& is working).

Now, to train my system (fine tune) I wanted to use the existing DS checkpoints. I’m using the check points extracted from “deepspeech-0.7.0-checkpoint.tar.gz”.

However, as I start my training using the following commands (some parameters removed to keep this short)

python DeepSpeech.py --n_hidden 2048 --checkpoint_dir ../deepspeech-0.7.0-checkpoint --export_dir ../trained_model/ --epochs 2  --train_files my-train.csv --dev_files my-dev.csv --test_files my-test.csv --train_cudnn=True --automatic_mixed_precision=True

I’m getting the following error:

tensorflow.python.framework.errors_impl.NotFoundError: Key cond_1/beta1_power not found in checkpoint

The full traceback is

File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/home/sayantan/Desktop/ai_learning/deepspeech_0_7/DeepSpeech/training/deepspeech_training/train.py", line 939, in run_script
absl.app.run(main)
File "/home/sayantan/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/sayantan/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/sayantan/Desktop/ai_learning/deepspeech_0_7/DeepSpeech/training/deepspeech_training/train.py", line 911, in main
train()
File "/home/sayantan/Desktop/ai_learning/deepspeech_0_7/DeepSpeech/training/deepspeech_training/train.py", line 511, in train
load_or_init_graph_for_training(session)
File "/home/sayantan/Desktop/ai_learning/deepspeech_0_7/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 132, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True)
File "/home/sayantan/Desktop/ai_learning/deepspeech_0_7/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 97, in _load_or_init_impl
return _load_checkpoint(session, ckpt_path, allow_drop_layers)
File "/home/sayantan/Desktop/ai_learning/deepspeech_0_7/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 70, in _load_checkpoint
v.load(ckpt.get_tensor(v.op.name), session=session)
File "/home/sayantan/.local/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key cond_1/beta1_power not found in checkpoint

Could you help why this is happening? Is the checkpoint missing some variable?

lissyx · April 29, 2020, 9:22am

Search the training docs with your error, you are missing load cudnn flag.

sayantangangs.91 · April 29, 2020, 4:49pm

I had this. But I shall look into them. Really sorry if that’s the case.

lissyx · April 29, 2020, 5:09pm

This is the flag to perform training. You need (also) the loading one.

sayantangangs.91 · April 30, 2020, 12:30pm

Makes sense. I also got another error, in the middle of the training. So, I shall post it only after proof checking more.

hmen97 · May 4, 2020, 12:01pm

Hi,
Regarding the same issue, isn’t the load_cudnn flag used to convert a CUDNN RNN checkpoint to run on CPU, when I use both the flags I get this message

E Trying to use --train_cudnn, but --load_cudnn was also specified. The --load_cudnn flag is only needed when converting a CuDNN RNN checkpoint to a CPU-capable graph. If your system is capable of using CuDNN RNN, you can just specify the CuDNN RNN checkpoint normally with --save_checkpoint_dir.

PS: I have a GPU system. Also if this is a bug should I raise an issue?
Edit: I think the issue is with the automatic_mixed_precision flag.

A_N · May 23, 2020, 6:45pm

I am seeing the same issue with fine tuning and using automatic_mixed_precision flag with 0.7.0. I believe only train_cudnn flag is needed as I working on a GPU as per flags from --helpful

python DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.7.0-checkpoint --epochs 100 --train_files bin/voxforge/voxforge-train.csv --dev_files bin/voxforge/voxforge-dev.csv --learning_rate 0.000001 --scorer_path models/deepspeech-0.7.0-models.scorer --train_cudnn --use_allow_growth --train_batch_size 32 --dev_batch_size 32 --es_epochs 10 --early_stop True --automatic_mixed_precision

failing with error:

tensorflow.python.framework.errors_impl.NotFoundError: Key cond_1/beta1_power not found in checkpoint

Is the automatic_mixed_precision flag only supported for fresh training and not for fine tuning? I did notice that the flag is not supported

thanks

othiele · May 23, 2020, 6:52pm

I am the second one on this thread to state: “Reading the docs helps”

https://deepspeech.readthedocs.io/en/v0.7.1/TRAINING.html#fine-tuning-same-alphabet

A_N · May 23, 2020, 7:21pm

Thank you very much for pointing this… Thought I read this but have missed the last portion on automatic_mixed_precision

thanks again