Fine tuning Deepspeech 0.9.1 with same alphabet

Ghada_Mjanah · November 27, 2020, 10:00am

Mozilla STT version: Deepspeech 0.9.1
OS: Linux 18.04
Python: 3.6.5
Tensorflow-gpu version: 1.15.4
GPU: NVIDIA GeForce MX230
CUDA version:

(env) ghada@ghada-Inspiron-3593:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

CUDNN version: cudnn-10.1-linux-x64-v7.6.5.32

Hello I’ve been looking into DS for a while.

I’ve installed DeepSpeech with pip.
pip3 install deepspeech
and downloaded both the pre-trained model and the scorer from the latest release (v0.9.1).
I ran inference with
deepspeech --model /my/path/to/deepspeech-0.9.1-models.pbmm --scorer /my/path/to/deepspeech-0.9.1-models.scorer --audio /my/path/to/myaudio.wav
and got the following results:

2020-11-27 03:42:35.544279: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading model from file /home/ghada/deepspeech-0.9.1-models.pbmm
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.1-0-gab8bd3e
2020-11-27 03:42:35.646252: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-27 03:42:35.647121: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-27 03:42:35.669893: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 03:42:35.670202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce MX230 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 2 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 44.76GiB/s
2020-11-27 03:42:35.670292: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-27 03:42:35.673877: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-27 03:42:35.674952: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-27 03:42:35.675210: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-27 03:42:35.676835: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-11-27 03:42:35.677750: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-11-27 03:42:35.681210: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-11-27 03:42:35.681316: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 03:42:35.681637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 03:42:35.681894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-27 03:42:35.941946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-27 03:42:35.941990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2020-11-27 03:42:35.941993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2020-11-27 03:42:35.942156: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 03:42:35.942463: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 03:42:35.942728: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 03:42:35.943037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1259 MB memory) -> physical GPU (device: 0, name: GeForce MX230, pci bus id: 0000:01:00.0, compute capability: 6.1)
Loaded model in 0.321s.
Loading scorer from files /home/ghada/deepspeech-0.9.1-models.scorer
Loaded scorer in 0.000138s.
Running inference.
2020-11-27 03:42:35.998799: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
karen read was on them by the greatness of your arm there is still as a stone and talouel as overlord until the people pass over who you have purchased
Inference took 4.402s for 9.189s audio file.

The point of this work is to get to train deepspeech model over bible verses in order to become familiar with bible vocabulary.
I have tested DS with several .wav files and I was getting a few mistakes in the transcripts that I’m trying to avoid by training the pretrained model on 73 hours of bible audio files (31080 files).

I will try to document my every step so the experts can point out my mistakes .
I’ve followed the documentation steps,

git clone --branch v0.9.1 https://github.com/mozilla/DeepSpeech
python3 -m venv $HOME/tmp/env/
source $HOME/tmp/env/bin/activate

  cd DeepSpeech
  pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
  pip3 install --upgrade -e .
  sudo apt-get install python3-dev ```

I also followed these steps to get CUDA and CUDNN in the required versions.

skipped Dockerfile part.
downloaded the checkpoint model, pre-trained model and scorer from latest release (v0.9.1)
prepared data:
- I formatted .wav files to int16, sample rate 16000 and mono channel.
- I splitted my corpus to a 7:2:1 ratio for train:dev:test respectively,
- My CSV files contain wav_filename,wav_filesize,transcript columns.

Finally, I used the following command to train the model:

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir /home/ghada/deepspeech-0.9.1-checkpoint/ --epochs 3 --train_cudnn --train_files train/train.csv --dev_files dev/dev.csv --test_files test/test.csv --scorer /home/ghada/deepspeech-0.9.1-models.scorer --learning_rate 0.0001 --export_dir output/ --export_tflite

I’ve set epochs to only 3 because I first want to train the model over only a few files to get to estimate time consumption, but later I will set that to something between 10-20 epochs (please correct me if I’m wrong )
Running this command gave me the following error:

I Loading variable from checkpoint: beta1_power
Traceback (most recent call last):
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [seed=4568, dropout=0, num_params=8, input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

	 [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 976, in run_script
    absl.app.run(main)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 948, in main
    train()
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 527, in train
    load_or_init_graph_for_training(session)
  File "/home/ghada/DeepSpeech/deepspeech_training/util/checkpoints.py", line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
  File "/home/ghada/DeepSpeech/deepspeech_training/util/checkpoints.py", line 98, in _load_or_init_impl
    return _load_checkpoint(session, ckpt_path, allow_drop_layers, allow_lr_init=allow_lr_init)
  File "/home/ghada/DeepSpeech/deepspeech_training/util/checkpoints.py", line 71, in _load_checkpoint
    v.load(ckpt.get_tensor(v.op.name), session=session)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py", line 1033, in load
    session.run(self.initializer, {self.initializer.inputs[1]: value})
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /home/ghada/anaconda3/envs/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [seed=4568, dropout=0, num_params=8, input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU, XLA_GPU]
Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]
 [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

I searched around for a similarerror and it turned out to be wrong CUDA and CUDNN versions (for other deepspeech releases), yet I have the required versions as mentioned above and yet I uninstalled it and reinstalled it several times.
when I try to see if I have GPU enabled over tensorflow using

import tensorflow as tf; tf.test.is_gpu_available()

I get the following:

2020-11-27 04:33:19.587097: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1497600000 Hz
2020-11-27 04:33:19.587869: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560309ea6170 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-27 04:33:19.587921: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-11-27 04:33:19.591450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-27 04:33:19.648328: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 04:33:19.648744: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560309f3f2a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-27 04:33:19.648757: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce MX230, Compute Capability 6.1
2020-11-27 04:33:19.648926: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-27 04:33:19.649194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: GeForce MX230 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:01:00.0
2020-11-27 04:33:19.649328: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64
2020-11-27 04:33:19.649394: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64
2020-11-27 04:33:19.649485: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64
2020-11-27 04:33:19.649579: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64
2020-11-27 04:33:19.649683: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64
2020-11-27 04:33:19.649745: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda/lib64:/usr/local/cuda-10.1/lib64
2020-11-27 04:33:19.652300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-27 04:33:19.652318: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-11-27 04:33:19.652338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-27 04:33:19.652354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 
2020-11-27 04:33:19.652364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N 
False

If I run the same deepspeech command to train over CPU, ( replace --train_cudnn flag by --load_cudnn ) it works perfectly, but it just takes soo long, this is why I want to train over GPU.

Did I miss a step somewhere?

lissyx · November 27, 2020, 11:02am

wrong: CUDA 10.0 + CuDNN v7.6 as documented by TensorFlow r1.15.

lissyx · November 27, 2020, 11:03am

Wrong, please avoid using Conda and stick to vanilla Python with a deepspeech-dedicated virtualenv.

lissyx · November 27, 2020, 11:42am

@Ghada_Mjanah Thanks for having taken the time to point where our doc were misleading: in the past, training and inference tensorflow versions were the same and so the doc was linking from training to usage for cuda deps. This is not true anymore. To add, TensorFlow official docs misses r1.15 … So we link to their Dockerhub image.

Ghada_Mjanah · November 27, 2020, 11:48am

I will change the version right away, thanks !

Ghada_Mjanah · November 27, 2020, 11:49am

thanks for pointing that out! I didn’t pay attention, I will not use conda

Ghada_Mjanah · November 27, 2020, 11:50am

You are most welcome

Ghada_Mjanah · November 30, 2020, 8:27am

@lissyx I’m now using CUDA 10.0.130 and CUDNN 7.6.0 and this fixed the GPU problem (thanks again! ), now I have GPU available over tensorflow, yet I have a new error now that says:

Traceback (most recent call last):
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)

  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d}}]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d}}]]
	 [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 976, in run_script
    absl.app.run(main)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 948, in main
    train()
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 605, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 570, in run_set
    feed_dict=feed_dict)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[tower_0/gradients/tower_0/BiasAdd_2_grad/BiasAddGrad/_87]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/conv1d':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 976, in run_script
    absl.app.run(main)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 948, in main
    train()
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 483, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 316, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 243, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 171, in create_model
    batch_x = create_overlapping_windows(batch_x)
  File "/home/ghada/DeepSpeech/deepspeech_training/train.py", line 69, in create_overlapping_windows
    batch_x = tf.nn.conv1d(input=batch_x, filters=eye_filter, stride=1, padding='SAME')
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d
    name=name)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/ghada/python-environments/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

I’m using vanilla python as suggested (python version 3.6.9).
When I searched this issue, it says that it’s a CUDNN incompatibility, I changed CUDNN version to 7.6.5 but the problem persists, can you please tell me where I’m going wrong ?

lissyx · November 30, 2020, 8:45am

I’m sorry, I can’t tell you what is wrong, this is a TensorFlow / CUDNN
error, and I have no idea what is going on. Try running with more
verbose logging, maybe.

lissyx · November 30, 2020, 9:50am

make sure you don’t have any other process using the GPU, I just got the same here, with GNOME 3 running …

Ghada_Mjanah · November 30, 2020, 9:59am

more verbose didn’t give any additional info

Ghada_Mjanah · November 30, 2020, 10:00am

i restarted my PC and nvidia-smi states 0% GPU utilization

othiele · November 30, 2020, 10:02am

Sometimes it helps to just start over because you fiddled with all the libs too much. Why don’t you try it in a fresh environment. Helped me in the past.

lissyx · November 30, 2020, 10:07am

it’s not about GPU usage, it’s about GPU being locked / GPU memory being allocated even a tiny portion.

Ghada_Mjanah · November 30, 2020, 10:08am

I will try in a new env right away !

lissyx · November 30, 2020, 10:09am

@Ghada_Mjanah Like I said, I just ran into a similar stack, and stopping gdm3 / gnome3 fixed it. Those errors can have a lot of root cause, and debugging CUDNN is hard, and not in our scope, sorry.

Ghada_Mjanah · November 30, 2020, 11:42am

nvidia-smi shows that these processes are running over GPU:

                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       980      G   /usr/lib/xorg/Xorg                 47MiB |
|    0   N/A  N/A      1065      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      1375      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      1561      G   /usr/bin/gnome-shell               39MiB |
+-----------------------------------------------------------------------------+

I can’t stop either of them, I don’t think that’s the problem for me …

Ghada_Mjanah · November 30, 2020, 11:43am

I created a whole new environment and did reboot, same issue …

lissyx · November 30, 2020, 12:35pm

Believe whatever you want, I can just tell you I had exactly this issue earlier today and that killing GNOME3 by sudo systemctl stop gdm3.service helped. If you are not willing to try our suggestions, we can’t help you.

Ghada_Mjanah · November 30, 2020, 1:18pm

@lissyx I’m sorry if I made you understand that I’m not willing to try your suggestion, I meant it didn’t work for my case… I’m still getting this error:

tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[Mean/_61]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ghada/python-environments/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

I changed to CUDNN 7.6.2 and a new environment, also same issue

Topic		Replies	Views
DeepSpeech problems with video card DeepSpeech	6	1759	July 15, 2019
The same spped with cpu and with gpu DeepSpeech	42	2277	May 3, 2020
Cannot start fine-tuning with DeepSpeech 0.6.1 DeepSpeech	11	1263	September 28, 2020
Failing to start the training with 0.7.0 DeepSpeech	5	1102	November 19, 2020
Right CUDA version for using deepspeech-gpu DeepSpeech	12	3775	June 27, 2019

Fine tuning Deepspeech 0.9.1 with same alphabet

Related topics