Can not train DeepSpeech on GTX 2070

ISSUE
Can not train DeepSpeech on GTX 2070. Tensorflow 1.13 isn’t compatible with the newer graphics card.

ERROR

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

ACTIONS

  1. Tensorflow 1.13 was compiled and built from source - the issue persists.

  2. Added extra configuration @ config.py line 63:
    c.session_config.gpu_options.per_process_gpu_memory_fraction = 0.6
    c.session_config…gpu_options.allow_growth = True
    The two configurations did not resolve the issue.

INFO
Using the latest version of DeepSpeech ( v0.5.0-alpha) with NVIDIA GTX 2070, CUDA-10, CUDNN-7.5, TensorflowGPU-1.13.1

LOG

root@953d2eb1cfea:/DeepSpeech-root/DeepSpeech# ./run-ldc93s1.sh 
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ [ -d  ]
+ python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path("deepspeech/ldc93s1"))
+ checkpoint_dir=/root/.local/share/deepspeech/ldc93s1
+ export CUDA_VISIBLE_DEVICES=0
+ python -u DeepSpeech.py --train_files data/ldc93s1/ldc93s1.csv --test_files data/ldc93s1/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --log_level 0  
2019-05-14 11:58:21.309114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-14 11:58:21.424454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 11:58:21.425146: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x33890f0 executing computations on platform CUDA. Devices:
2019-05-14 11:58:21.425163: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-05-14 11:58:21.427067: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3491870000 Hz
2019-05-14 11:58:21.427566: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c633e0 executing computations on platform Host. Devices:
2019-05-14 11:58:21.427587: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-14 11:58:21.427976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:02:00.0
totalMemory: 7.76GiB freeMemory: 7.39GiB
2019-05-14 11:58:21.427994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:21.428764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:21.428779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 11:58:21.428786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 11:58:21.429141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
    tf.py_function, which takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/rnn/python/ops/lstm_ops.py:696: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-05-14 11:58:22.280196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-14 11:58:22.280233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:58:22.280241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-14 11:58:22.280247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-14 11:58:22.280595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7185 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:02:00.0, compute capability: 7.5)
D Session opened.
I Initializing variables...
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                   2019-05-14 11:58:23.028141: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-05-14 11:58:24.275096: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-05-14 11:58:24.289759: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node tower_0/conv1d/Conv2D}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 510, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "DeepSpeech.py", line 483, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

Caused by op 'tower_0/conv1d/Conv2D', defined at:
  File "DeepSpeech.py", line 829, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "DeepSpeech.py", line 813, in main
    train()
  File "DeepSpeech.py", line 400, in train
    gradients, loss = get_tower_results(iterator, optimizer, dropout_rates)
  File "DeepSpeech.py", line 253, in get_tower_results
    avg_loss = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "DeepSpeech.py", line 186, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse)
  File "DeepSpeech.py", line 119, in create_model
    batch_x = create_overlapping_windows(batch_x)
  File "DeepSpeech.py", line 56, in create_overlapping_windows
    batch_x = tf.nn.conv1d(batch_x, eye_filter, stride=1, padding='SAME')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 574, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 3482, in conv1d
    data_format=data_format)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node tower_0/conv1d/Conv2D (defined at DeepSpeech.py:56) ]]

root@953d2eb1cfea:/DeepSpeech-root/DeepSpeech#    

Can this issue be resloved? Any help appreciated. Thanks

Looks like your CuDNN setup is improper ? This is not really a DeepSpeech issue. It works well here on RTX2080Ti.

Okay will verify. thanks

That is a problem with your Cuda version, I have had that problem when I upgraded to version 0.5 with TF 1.13. Are you sure that you are using Cuda 10.0 and not Cuda 10.1?

I have exactly the same problem…
Have cuda 10.0 and cudnn 7.5.1 installed and the gpu seems to run out of memory.

Any updates or solutions?

No updates or solutions. I verified proper setup of Cuda10 and Cudnn7.5. Can train with other cards.

If you are experiencing OOM on the GPU, please reduce batch size

What do you mean can train with other cards ?

Anyway, the reported error does not come from DeepSpeech code, at best it would be an upstream TensorFlow issue, nothing we can help about.

The error explicitely states failure at initializing CuDNN.