Error using checkpoint 0.6.1 while training on own data

I am resuming training on my own dataset using checkpoint 0.6.1. Specification; tensorflow - 1.14, Ubuntu 16.04, Cuda 10, CuDNN - 7.5.

I have downloaded the realised checkpoint and using the following command

CUDA_VISIBLE_DEVICES=2,3 python3 --n_hidden 2048 --checkpoint_dir checkcheck/deepspeech-0.6.1-checkpoint/ --epochs 3 --train_files extracted/language/archive/clips/train.csv --dev_files extracted/language/archive/clips/dev.csv --test_files extracted/language/archive/clips/dev.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true

However, i am getting the following error

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784
I0118 11:53:46.365865 140480301135616] Restoring parameters from checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784
E Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
E No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at with these attrs: [dropout=0, seed=4568, num_params=8, input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="lstm", seed2=240]
E Registered devices: [CPU, XLA_CPU]
E Registered kernels:
E   device='GPU'; T in [DT_DOUBLE]
E   device='GPU'; T in [DT_FLOAT]
E   device='GPU'; T in [DT_HALF]
E        [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
E **The checkpoint in checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784 does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of checkcheck/deepspeech-0.6.1-checkpoint/best_dev-233784.**

I have checked the alphabet file and it seems fine. Can someone guide if i am missing out something?

1 Like

Isn’t this error clear enough ? Your CUDNN setup seems wrong.

@adesara.amit Also, upstream documents CUDNN 7.6 for CUDA 10.0:

@adesara.amit I faced the same issue with the following setup:

checkpoint version: 0.6.1
DeepSpeech version: 0.6.1
Tensorflow: 1.14.0-gpu
CUDA: 10.1
CuDNN: 7.6.5
Nvidia driver: 418.67

For me, using the flag --cudnn_checkpoint instead of --checkpoint_dir fixed it. I’m not sure if there is any performance loss.

Edit: Actually, there is a performance loss as one cannot use CuDNN in this case. This is a workaround but obviously not a fix.

I briefly encountered this, was able to solve it by making sure the right version of Tensorflow, CUDA, and CuDNN is installed. Here’s what I have:

  1. python 3.6
  2. CUDA: 10.0
  3. CuDNN: 7.6.4
  4. Tensorflow: 1.14.0-gpu

I’m running in conda env.
I believe CUDA 10.1 with CuDNN 7.6.5 is incorrect if you are using TF 1.14.0.
If you are on, TF 1.15.0, then you can use CUDA 10.0 with CuDNN 7.6.5.

After installation, you can run

make sure the device number matches what’s visible in your env var, if you have device ‘0’ then set os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’
In most cases you shouldn’t need to set this.

also check the following returns True

Hope this helps.

1 Like

correction, my CuDNN version is 7.4.1 (not 7.6.4)

Hi Mohamed,

I am facing the same issue as you, I have the same CUDA/CUDNN settings as you and if I specify —checkpoint_dir, its not working! Using cudnn_checkpoint workaround is helping to tempororaly fix the issue but the results aren’t great! I am running it on a google colab instance, can you please let me know the CUDA version specifications that you changed to make it work? Much appreciated