Can not train DeepSpeech on GTX 2070

tyler_stewartt · May 14, 2019, 11:52am

ISSUE
Can not train DeepSpeech on GTX 2070. Tensorflow 1.13 isn’t compatible with the newer graphics card.

ERROR

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

ACTIONS

Tensorflow 1.13 was compiled and built from source - the issue persists.
Added extra configuration @ config.py line 63:
c.session_config.gpu_options.per_process_gpu_memory_fraction = 0.6
c.session_config…gpu_options.allow_growth = True
The two configurations did not resolve the issue.

INFO
Using the latest version of DeepSpeech ( v0.5.0-alpha) with NVIDIA GTX 2070, CUDA-10, CUDNN-7.5, TensorflowGPU-1.13.1

LOG

root@953d2eb1cfea:/DeepSpeech-root/DeepSpeech# ./run-ldc93s1.sh 
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ [ -d  ]
+ python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1"))
+ checkpoint_dir=/root/.local/share/deepspeech/ldc93s1
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --log_level 0  
2019-05-14 11:58:21.309114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-14 11:58:21.424454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 11:58:21.425146: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x33890f0 executing computations on platform CUDA. Devices:
2019-05-14 11:58:21.425163: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-05-14 11:58:21.427067: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3491870000 Hz
2019-05-14 11:58:21.427566: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c633e0 executing computations on platform Host. Devices:
2019-05-14 11:58:21.427587: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-14 11:58:21.427976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:02:00.0
totalMemory: 7.76GiB freeMemory: 7.39GiB
2019-05-14 11:58:21.427994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:21.428764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:21.428779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 11:58:21.428786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 11:58:21.429141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-05-14 11:58:22.280196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:22.280233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:22.280241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 11:58:22.280247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 11:58:22.280595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
D Session opened.
I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                   2019-05-14 11:58:23.028141: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-05-14 11:58:24.275096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-14 11:58:24.289759: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d/Conv2D}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 510, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 483, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

Caused by op 'tower_0/conv1d/Conv2D', defined at:
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 400, in train
    gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 253, in get_tower_results
    avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
  File "DeepSpeech.py", line 119, in create_model
    batch_x = create_overlapping_windows(batch_x)
  File "DeepSpeech.py", line 56, in create_overlapping_windows
    batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
    data_format=data_format)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

root@953d2eb1cfea:/DeepSpeech-root/DeepSpeech#

Can this issue be resloved? Any help appreciated. Thanks

lissyx · May 14, 2019, 12:12pm

Looks like your CuDNN setup is improper ? This is not really a DeepSpeech issue. It works well here on RTX2080Ti.

tyler_stewartt · May 14, 2019, 12:18pm

Okay will verify. thanks

daniel.cruzado · May 16, 2019, 7:26am

That is a problem with your Cuda version, I have had that problem when I upgraded to version 0.5 with TF 1.13. Are you sure that you are using Cuda 10.0 and not Cuda 10.1?

SooMSooM · May 23, 2019, 1:25pm

I have exactly the same problem…
Have cuda 10.0 and cudnn 7.5.1 installed and the gpu seems to run out of memory.

Any updates or solutions?

tyler_stewartt · May 23, 2019, 4:44pm

No updates or solutions. I verified proper setup of Cuda10 and Cudnn7.5. Can train with other cards.

lissyx · May 24, 2019, 8:02am

If you are experiencing OOM on the GPU, please reduce batch size

lissyx · May 24, 2019, 8:03am

What do you mean can train with other cards ?

Anyway, the reported error does not come from DeepSpeech code, at best it would be an upstream TensorFlow issue, nothing we can help about.

The error explicitely states failure at initializing CuDNN.

Topic		Replies	Views
DeepSpeech problems with video card DeepSpeech	6	1746	July 15, 2019
Training DeepSpeech on gpu failed DeepSpeech issue	3	1154	July 20, 2021
Failing to start the training with 0.7.0 DeepSpeech	5	1088	November 19, 2020
Fine tuning Deepspeech 0.9.1 with same alphabet DeepSpeech learning	40	1472	December 4, 2020
The same spped with cpu and with gpu DeepSpeech	42	2249	May 3, 2020

Can not train DeepSpeech on GTX 2070

Related topics