Failing to start the training with 0.7.0

I followed the documentation for building the environment step by step:
After running the training i get to the following error

STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                     Traceback (most recent call last):
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d}}]]
	 [[tower_0/gradients/tower_0/BiasAdd_3_grad/BiasAddGrad/_131]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d}}]]
0 successful operations.
1 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "./DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/ubuntu/ds/lib/python3.6/site-packages/DeepSpeech/training/deepspeech_training/train.py", line 939, in run_script
    absl.app.run(main)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/ubuntu/ds/lib/python3.6/site-packages/DeepSpeech/training/deepspeech_training/train.py", line 911, in main
    train()
  File "/home/ubuntu/ds/lib/python3.6/site-packages/DeepSpeech/training/deepspeech_training/train.py", line 589, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/DeepSpeech/training/deepspeech_training/train.py", line 549, in run_set
    feed_dict=feed_dict)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[tower_0/gradients/tower_0/BiasAdd_3_grad/BiasAddGrad/_131]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/ubuntu/ds/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
1 derived errors ignored.
Original stack trace for 'tower_0/conv1d':

I Tried manually setting the TF_FORCE_GPU_ALLOW_GROWTH to true
I tried adding os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' to top of DeepSpeech.py (of course importing OS)

Did i miss something during the process ?

Steps: After cloning DeepSpeech and mozilla TensorFlow

  • pip install --upgrade pip==20.0.2 wheel==0.34.2 setuptools==46.1.3
  • pip install --upgrade --force-reinstall -e .
  • pip uninstall tensorflow -y
  • pip install ‘tensorflow-gpu==1.15.2’
  • python3 generate_lm.py --input_txt …/vocabulary.txt --output_dir output/ --top_k 5000000 --kenlm_bins …/…/…/build/bin/ --arpa_order 3 --max_arpa_memory “85%” --arpa_prune “0|0|1” --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
  • python3 generate_package.py --alphabet …/alphabet.txt --lm output/lm.binary --vocab output/vocab-5000000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284
1 Like

You already tried the GPU memory growth thing, the only other cause I’ve seen reported for this error is incorrect versions of CUDA or CuDNN. Make sure you’re on CUDA 10.0 and CuDNN 7.6.2

2 Likes

Hello!
I’m facing the same error while trying to fine-tune checkpoint on Callhome dataset.

I’m using this command to run training:

python DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.7.0-checkpoint/ --epochs 3 --train_files train_full.csv --dev_files val_full.csv --test_files test_full.csv --learning_rate 0.0001 --scorer_path deepspeech-0.7.0-models.scorer --train_cudnn --use_allow_growth

My dependencies:
CUDA 10.0, CuDNN 7.6.2, Ubuntu 18.04, tensorflow-gpu 1.15.2

Error text:

Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                   Traceback (most recent call last):
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d}}]]
	 [[tower_0/gradients/tower_0/MatMul_4_grad/tuple/control_dependency_1/_113]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d}}]]
0 successful operations.
1 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 13, in <module>
    ds_train.run_script()
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 945, in run_script
    absl.app.run(main)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 917, in main
    train()
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 594, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 554, in run_set
    feed_dict=feed_dict)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[tower_0/gradients/tower_0/MatMul_4_grad/tuple/control_dependency_1/_113]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d (defined at /home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
1 derived errors ignored.

Original stack trace for 'tower_0/conv1d':
  File "DeepSpeech.py", line 13, in <module>
    ds_train.run_script()
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 945, in run_script
    absl.app.run(main)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 917, in main
    train()
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 480, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 318, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 245, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 173, in create_model
    batch_x = create_overlapping_windows(batch_x)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/training/deepspeech_training/train.py", line 75, in create_overlapping_windows
    batch_x = tf.nn.conv1d(input=batch_x, filters=eye_filter, stride=1, padding='SAME')
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d
    name=name)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1071, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/dash/projects/DeepSpeech/DeepSpeech-0.7.0/env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Are these ( CUDA 10.0 and CUDNN 7.6.2 ) still same For deepspeech 0.9.1?

Update : I found this in doc for v0.9.1

The GPU capable builds (Python, NodeJS, C++, etc) depend on CUDA 10.1 and CuDNN v7.6.

all the CuDNN v7.6.* are acceptable?

Solved.

USE conda env and it will handle all the CUDA dependencies.

instead of venv install conda and create virtual environment with conda ( notice you need python 3.6 for deep speech 0.9.1 environment you create). then do the steps as documentation says.

then uninstall tensorflow and install tensorflow-gpu=1.15 with conda

conda install tensorflow-gpu=1.15

Done. enjoy training

2 Likes

I will reply if it’s successful. And will thank you wholeheartedly.