Error using checkpoint 0.6.1 while training on own data

adesara.amit · January 18, 2020, 12:02pm

I am resuming training on my own dataset using checkpoint 0.6.1. Specification; tensorflow - 1.14, Ubuntu 16.04, Cuda 10, CuDNN - 7.5.

I have downloaded the realised checkpoint and using the following command

CUDA_VISIBLE_DEVICES=2,3 python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir checkcheck/deepspeech-0.6.1-checkpoint/ --epochs 3 --train_files extracted/language/archive/clips/train.csv --dev_files extracted/language/archive/clips/dev.csv --test_files extracted/language/archive/clips/dev.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

However, i am getting the following error

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784
I0118 11:53:46.365865 140480301135616 saver.py:1280] Restoring parameters from checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784
E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
E
E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at DeepSpeech.py:119) with these attrs: [dropout=0, seed=4568, num_params=8, input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="lstm", seed2=240]
E Registered devices: [CPU, XLA_CPU]
E Registered kernels:
E   device='GPU'; T in [DT_DOUBLE]
E   device='GPU'; T in [DT_FLOAT]
E   device='GPU'; T in [DT_HALF]
E
E        [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
E **The checkpoint in checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784.**

I have checked the alphabet file and it seems fine. Can someone guide if i am missing out something?

lissyx · January 18, 2020, 12:55pm

Isn’t this error clear enough ? Your CUDNN setup seems wrong.

lissyx · January 18, 2020, 12:59pm

@adesara.amit Also, upstream documents CUDNN 7.6 for CUDA 10.0: https://www.tensorflow.org/install/gpu#ubuntu_1604_cuda_10

Mohamed_Laradji · January 24, 2020, 1:46am

@adesara.amit I faced the same issue with the following setup:

checkpoint version: 0.6.1
DeepSpeech version: 0.6.1
Tensorflow: 1.14.0-gpu
CUDA: 10.1
CuDNN: 7.6.5
Nvidia driver: 418.67

For me, using the flag --cudnn_checkpoint instead of --checkpoint_dir fixed it. I’m not sure if there is any performance loss.

Edit: Actually, there is a performance loss as one cannot use CuDNN in this case. This is a workaround but obviously not a fix.

KevinNotable · February 5, 2020, 8:54am

I briefly encountered this, was able to solve it by making sure the right version of Tensorflow, CUDA, and CuDNN is installed. Here’s what I have:

python 3.6
CUDA: 10.0
CuDNN: 7.6.4
Tensorflow: 1.14.0-gpu

I’m running in conda env.
I believe CUDA 10.1 with CuDNN 7.6.5 is incorrect if you are using TF 1.14.0.
If you are on, TF 1.15.0, then you can use CUDA 10.0 with CuDNN 7.6.5.

After installation, you can run
tf.test.is_gpu_available()

make sure the device number matches what’s visible in your env var, if you have device ‘0’ then set os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’
In most cases you shouldn’t need to set this.

also check the following returns True
tf.test.is_built_with_cuda()

Hope this helps.

KevinNotable · February 11, 2020, 10:40pm

correction, my CuDNN version is 7.4.1 (not 7.6.4)

kb1995 · April 8, 2020, 7:23pm

Hi Mohamed,

I am facing the same issue as you, I have the same CUDA/CUDNN settings as you and if I specify —checkpoint_dir, its not working! Using cudnn_checkpoint workaround is helping to tempororaly fix the issue but the results aren’t great! I am running it on a google colab instance, can you please let me know the CUDA version specifications that you changed to make it work? Much appreciated

Best
Krishna

Topic		Replies	Views
Errors when I try to use a pre-trained model with checkpoints DeepSpeech	7	3534	November 14, 2020
Error with Deepspeech 0.6.1: Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam not found in checkpoint [[{{node save/RestoreV2}}]] DeepSpeech	3	2295	April 8, 2020
Cannot start fine-tuning with DeepSpeech 0.6.1 DeepSpeech	11	1260	September 28, 2020
Error while fine tuning DeepSpeech learning , issue	2	442	November 25, 2020
Finetuning the model on gpu machine #CudnnRNNCanonicalToParams DeepSpeech	3	529	September 12, 2020

Error using checkpoint 0.6.1 while training on own data

Related topics