Retrain from 0.6.0 release checkpoint fails

@lissyx @reuben I tried to start training from 0.6.0 release checkpoint with my own english dataset, but the restoration process fails with the below error. I tried using DeepSpeech code from “master” branch as well as from tag “v0.6.0”. Do you have any suggestions to resolve this issue?

INFO:tensorflow:Restoring parameters from /home/sranjeet/Documents/Speech_DataSet/0.6.0-ASR/train1_sva_ip_0.6.0/best_dev-233784
I1205 14:33:44.495084 140431932188480 saver.py:1280] Restoring parameters from /home/sranjeet/Documents/Speech_DataSet/0.6.0-ASR/train1_sva_ip_0.6.0/best_dev-233784
Traceback (most recent call last):
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1356, in _do_call
return fn(args)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key cond_1/beta1_power not found in checkpoint
** [[{{node save_1/RestoreV2}}]]
*

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1286, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 950, in run
run_metadata_ptr)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1173, in _run
feed_dict_tensor, options, run_metadata)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1350, in _do_run
run_metadata)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key cond_1/beta1_power not found in checkpoint
** [[node save_1/RestoreV2 (defined at DeepSpeech.py:495) ]]**

Original stack trace for ‘save_1/RestoreV2’:
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 938, in main
train()
File “DeepSpeech.py”, line 495, in train
best_dev_saver = tfv1.train.Saver(max_to_keep=1)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 825, in init
self.build()
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 837, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 875, in _build
build_restore=build_restore)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 508, in _build_internal
restore_sequentially, reshape)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 328, in _AddRestoreOps
restore_sequentially)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 575, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py”, line 1696, in restore_v2
name=name)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py”, line 788, in _apply_op_helper
op_def=op_def)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 3616, in create_op
op_def=op_def)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 2005, in init
self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1296, in restore
names_to_keys = object_graph_key_mapping(save_path)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1614, in object_graph_key_mapping
object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 678, in get_tensor
return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 938, in main
train()
File “DeepSpeech.py”, line 554, in train
loaded = try_loading(session, best_dev_saver, ‘best_dev_checkpoint’, ‘best validation’)
File “DeepSpeech.py”, line 403, in try_loading
saver.restore(session, checkpoint_path)
File “/home/sranjeet/DeepSpeech_0.6.0/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1302, in restore
err, “a Variable name or other graph key that is missing”)
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

1 Like

You need to either use --use_cudnn_rnn or specify the v0.6.0 checkpoint dir with the --cudnn_checkpoint flag.

@reuben thanks. I am now able to retrain from the release checkpoint, but the intermediate checkpoints produced as part of the training process are lesser in size(541MB) when compared to the release checkpoint which 700 MB. When I try to export such intermediate or final checkpoints that are generated as part of the training in to a tflite model, inference always fails with a crash.

Below is snapshot of the release checkpoint and what I generated as part of the training.

669M Dec 5 17:51 best_dev-233784.data-00000-of-00001
1.5K Dec 5 17:46 best_dev-233784.index
8.3M Dec 5 17:47 best_dev-233784.meta
541M Dec 5 21:18 best_dev-234081.data-00000-of-00001
1.5K Dec 5 21:18 best_dev-234081.index
2.4M Dec 5 21:18 best_dev-234081.meta

It would be more useful to share exact command lines to reproduce. There’s nothing we can do in the blind like that.

I’m also getting this problem. Having extracted deepspeech-0.6.0-checkpoint.tar.gz into /root/data/deepspeech/checkpoint produces the OP error.

python /DeepSpeech/DeepSpeech.py \
--checkpoint_dir=/root/data/deepspeech/checkpoint \
--train_files=/root/data/deepspeech/train-sb-clean-trimmed.csv \
--dev_files=/root/data/deepspeech/dev-sb-trimmed.csv \
--test_files=/root/data/deepspeech/test-sb-trimmed.csv \
--automatic_mixed_precision=True \
--use_cudnn_rnn=True \
--train_batch_size=32 \
--dev_batch_size=32 \
--test_batch_size=32 \
--export_batch_size=32 \
--export_dir=/root/data/deepspeech/export \
--export_language=sv \
--summary_dir=/root/data/deepspeech/tensorboard \
--lm_binary_path=/root/data/deepspeech/lm.bin \
--lm_trie_path=/root/data/deepspeech/trie \
--early_stop=False \
--epochs=200 \
--learning_rate=0.0001 \
--export_tflite=True \
--export_zip=True \
--log_dir=/root/data/deepspeech/logs

Adding --cudnn_checkpoint=True produces:

Trying to use --cudnn_checkpoint but --use_cudnn_rnn was specified. The --cudnn_checkpoint flag is only needed when converting a CuDNN RNN checkpoint to a CPU-capable graph. If your system is capable of using CuDNN RNN, you can just specify the CuDNN RNN checkpoint normally with --checkpoint_dir.

Removing --use_cudnn_rnn produces:

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory /root/data/deepspeech/checkpoint
1 Like

Rename best_dev_checkpoint to checkpoint.

1 Like

I was close but yet so far away. Thanks @lissyx