Problems Training on RTX3080

lukszi · October 12, 2020, 7:09am

Hi, I’m running on Ubuntu 18.04 with an Nvidia RTX 3080. I checked out the v0.8.2 label on github and only modified the alphabet.txt to accomodate the german language common voice dataset.
I’m getting a rather long error message when trying to run this command:

./DeepSpeech.py --train_files ./data/CV/de/clips/train.csv --dev_files ./data/CV/de/clips/dev.csv --test_files ./data/CV/de/clips/test.csv --log_level 0

I believe the error message boils down to this line:

2020-10-12 01:10:20.357132: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal : Value ‘sm_86’ is not defined for option ‘gpu-name’

the entire log looks like this:

2020-10-12 00:55:02.977870: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-12 00:55:02.999112: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3999980000 Hz
2020-10-12 00:55:02.999355: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5b8b4a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-12 00:55:02.999366: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-10-12 00:55:03.000739: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-12 00:55:03.075007: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.075414: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5c25230 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-12 00:55:03.075430: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3080, Compute Capability 8.6
2020-10-12 00:55:03.075539: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.075880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2020-10-12 00:55:03.076083: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-10-12 00:55:03.076901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-10-12 00:55:03.077578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-10-12 00:55:03.077745: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-10-12 00:55:03.078661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-10-12 00:55:03.079365: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-10-12 00:55:03.081532: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-12 00:55:03.081602: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.081934: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.082203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-10-12 00:55:03.082227: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-10-12 00:55:03.082778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-12 00:55:03.082787: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2020-10-12 00:55:03.082792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2020-10-12 00:55:03.082860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.083161: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.083446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 8801 MB memory) → physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
2020-10-12 00:55:03.753431: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.753746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2020-10-12 00:55:03.753775: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-10-12 00:55:03.753783: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-10-12 00:55:03.753792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-10-12 00:55:03.753800: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-10-12 00:55:03.753808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-10-12 00:55:03.753815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-10-12 00:55:03.753822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-12 00:55:03.753859: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.754153: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:03.754419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
WARNING:tensorflow:From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_types(iterator).
W1012 00:55:03.927856 139958901720896 deprecation.py:323] From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_types(iterator).
WARNING:tensorflow:From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_shapes(iterator).
W1012 00:55:03.928005 139958901720896 deprecation.py:323] From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_shapes(iterator).
WARNING:tensorflow:From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_classes(iterator).
W1012 00:55:03.928088 139958901720896 deprecation.py:323] From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_classes(iterator).
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* community/rfcs/20180907-contrib-sunset.md at master · tensorflow/community · GitHub
* GitHub - tensorflow/addons: Useful extra functionality for TensorFlow 2.x maintained by SIG-addons
* GitHub - tensorflow/io: Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W1012 00:55:04.018131 139958901720896 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/rnn/python/ops/lstm_ops.py:597: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
W1012 00:55:04.019410 139958901720896 deprecation.py:323] From /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/rnn/python/ops/lstm_ops.py:597: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From /home/lukas/DeepSpeech/training/deepspeech_training/train.py:245: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1012 00:55:04.068966 139958901720896 deprecation.py:323] From /home/lukas/DeepSpeech/training/deepspeech_training/train.py:245: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2020-10-12 00:55:04.431726: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:04.432038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2020-10-12 00:55:04.432065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-10-12 00:55:04.432074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-10-12 00:55:04.432082: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-10-12 00:55:04.432090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-10-12 00:55:04.432098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-10-12 00:55:04.432106: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-10-12 00:55:04.432114: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-12 00:55:04.432152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:04.432444: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:04.432710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-10-12 00:55:04.432726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-12 00:55:04.432732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 
2020-10-12 00:55:04.432735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N 
2020-10-12 00:55:04.432781: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:04.433072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-12 00:55:04.433344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8801 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
D Session opened.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                          2020-10-12 00:55:15.773105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-10-12 00:56:27.844114: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-12 01:10:20.357132: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once.
2020-10-12 01:10:21.064574: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(63, 494), b.shape=(494, 2048), m=63, n=2048, k=494
	 [[{{node tower_0/MatMul}}]]
	 [[concat/concat/_99]]
  (1) Internal: Blas GEMM launch failed : a.shape=(63, 494), b.shape=(494, 2048), m=63, n=2048, k=494
	 [[{{node tower_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 961, in run_script
    absl.app.run(main)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 933, in main
    train()
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 601, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 566, in run_set
    feed_dict=feed_dict)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(63, 494), b.shape=(494, 2048), m=63, n=2048, k=494
	 [[node tower_0/MatMul (defined at /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[concat/concat/_99]]
  (1) Internal: Blas GEMM launch failed : a.shape=(63, 494), b.shape=(494, 2048), m=63, n=2048, k=494
	 [[node tower_0/MatMul (defined at /home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/MatMul':
  File "./DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 961, in run_script
    absl.app.run(main)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 933, in main
    train()
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 479, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 312, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 239, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 180, in create_model
    layers['layer_1'] = layer_1 = dense('layer_1', batch_x, Config.n_hidden_1, dropout_rate=dropout[0])
  File "/home/lukas/DeepSpeech/training/deepspeech_training/train.py", line 82, in dense
    output = tf.nn.bias_add(tf.matmul(x, weights), bias)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
    name=name)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/lukas/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

apt list --installed | grep cuda
yields the following list:

cuda-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [installiert]
cuda-command-line-tools-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-compiler-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cublas-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cublas-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cudart-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cudart-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cufft-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cufft-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cuobjdump-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cupti-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-curand-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-curand-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cusolver-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cusolver-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cusparse-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-cusparse-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-demo-suite-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-documentation-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-driver-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-drivers/unbekannt,unbekannt,now 455.23.05-1 amd64 [Installiert,automatisch]
cuda-drivers-455/unbekannt,unbekannt,now 455.23.05-1 amd64 [Installiert,automatisch]
cuda-gdb-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-gpu-library-advisor-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-libraries-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-libraries-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-license-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-memcheck-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-misc-headers-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-npp-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-npp-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nsight-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nsight-compute-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvcc-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvdisasm-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvgraph-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvgraph-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvjpeg-10-0/unbekannt,now 10.0.130.1-1 amd64 [Installiert,automatisch]
cuda-nvjpeg-dev-10-0/unbekannt,now 10.0.130.1-1 amd64 [Installiert,automatisch]
cuda-nvml-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvprof-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvprune-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvrtc-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvrtc-dev-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvtx-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-nvvp-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-repo-ubuntu1804-10-0-local-10.0.130-410.48/now 1.0-1 amd64 [Installiert,lokal]
cuda-runtime-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-samples-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-toolkit-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-tools-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
cuda-visual-tools-10-0/unbekannt,unbekannt,now 10.0.130-1 amd64 [Installiert,automatisch]
libcudart9.1/bionic,now 9.1.85-3ubuntu1 amd64 [Installiert,automatisch]
libcudnn7/unbekannt,now 7.6.4.38-1+cuda10.0 amd64 [Installiert,aktualisierbar auf: 7.6.5.32-1+cuda10.2]
nvidia-cuda-dev/bionic,now 9.1.85-3ubuntu1 amd64 [Installiert,automatisch]
nvidia-cuda-doc/bionic,bionic,now 9.1.85-3ubuntu1 all [Installiert,automatisch]
nvidia-cuda-gdb/bionic,now 9.1.85-3ubuntu1 amd64 [Installiert,automatisch]
nvidia-cuda-toolkit/bionic,now 9.1.85-3ubuntu1 amd64 [installiert]

nvidia-smi results in this:

Mon Oct 12 09:04:21 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 On | 00000000:01:00.0 On | N/A |
| 0% 50C P5 46W / 320W | 519MiB / 10014MiB | 9% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1247 G /usr/lib/xorg/Xorg 34MiB |
| 0 N/A N/A 1303 G /usr/bin/gnome-shell 78MiB |
| 0 N/A N/A 1862 G /usr/lib/xorg/Xorg 254MiB |
| 0 N/A N/A 2010 G /usr/bin/gnome-shell 48MiB |
| 0 N/A N/A 2603 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 2712 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 2926 G …token=7611235723361034942 21MiB |
| 0 N/A N/A 3395 G …/debug.log --shared-files 15MiB |
| 0 N/A N/A 3702 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 3742 G /usr/lib/firefox/firefox 4MiB |
| 0 N/A N/A 13105 G gnome-control-center 4MiB |
| 0 N/A N/A 20096 G /usr/bin/vlc 35MiB |
±----------------------------------------------------------------------------+

The gpu-memory usage shoots up to 9gb and then remains there for a few minutes, while GPU-util stays <10%. The training then crashes. I believe it has something to do with the compute capability of the 3080 and the older cuda required by tensorflow 1.15 not being able to work with it.

lissyx · October 12, 2020, 7:24am

Which you can understand is out of the scope of what we can help here.

lukszi · October 12, 2020, 8:19am

I wasn’t looking for help as much as a confirmation of my hypothesis.

carlfm01 · October 13, 2020, 1:11pm

I’m able to train with 3000 series only with the Nvidia container or using their TF: https://github.com/NVIDIA/tensorflow#install

3090 for me only works with Cuda 11.1 and cudnn 8+ using the Nvidia’s TF 1.15

lukszi · October 14, 2020, 6:32am

Hey, thank you for your response.

Did you get it to run with the container or the Wheel Index?
If you did get it to run with the wheel index, would you be so kind as to send me the output of your pip freeze and the output of a DeepSpeech training with the argument “–log_level 0” so that I can see the library versions tensorflow uses?

Because for me tensorflow ignores the manually installed libraries and instead uses libraries that pip installed which leads to another crash.

angga.fsahid · November 25, 2020, 3:08am

could you please gives us more detail information about that?
since lot of people meet problem of using TF 1.15 alongside GTX 30 Series

othiele · November 25, 2020, 9:22am

Lots of people? They are sold out here

But you are right, there will be more and more people training on the 30er series. But @carlfm01 said to use their TF. What part of that is problematic for you. Try it and post in a new thread what you can’t solve. We are happy to help, but a general “I can’t” is hard to solve.

allen7575 · December 4, 2020, 1:01pm

As @carlfm01 says, NVIDIA maintains its own version of tensorflow 1.15 here: https://github.com/NVIDIA/tensorflow#install , which support latest gpu card.

So, you need to remove official tensorflow which installed through pip or conda, and install nvidia’s version, as its README.md says:

install the NVIDIA wheel index:

$ pip install --user nvidia-pyindex

install the current NVIDIA Tensorflow release:

$ pip install --user nvidia-tensorflow[horovod]

after installed, just use it as regular tensorflow:

import tensorflow as tf

lissyx · December 4, 2020, 1:35pm

I guess it could deserve a small doc PR for both master and r0.9 in the training section?

Denny_Chen · January 18, 2021, 11:40am

TF 1.15 requires libcudnn.so.7, libcudart.so.10.0.

carlfm01 · January 19, 2021, 7:36am

The Google version, the Nvidia’s version installs cuda 11 and Cudnn 8 which is a requirement to run ampere GPUs.

hiyassat · October 9, 2021, 4:22am

I just install rtx 3090 on ubuntu 20 what you need is
3090 for me only works with Cuda 11.1 and cudnn 8+ using the Nvidia’s TF 1.15
to install TF 1.15
pip install --user nvidia-pyindex

pip install --user nvidia-tensorflow[horovod] --use-feature=2020-resolver
and it woks fine

Emanuel_zamorano · November 9, 2023, 5:09am

Hello, there, what about windows? I see that the prerequisites are for Ubuntu and not Windows, but I am trying to train on Windows 10 and am having a very similar issue. Can you help me with this? Maybe we can chat on a different platform for faster communication?

Emanuel_zamorano · November 9, 2023, 5:10am

My friend, what about windows? I see that the prerequisites are for Ubuntu, but I am trying to train on Windows 10. Would you be so kind as to help me?

lissyx · November 9, 2023, 8:07am

Windows was never supported for training. Now that the project is defunct, it’s not going to improve.

Emanuel_zamorano · November 9, 2023, 7:04pm

Yes that is exactly the conclusion I was coming to after weeks of trying to train it on Windows with Gpu. I felt defeated. Interestingly enough, before trying to train with my RTX 3090 on my Windows PC, I was able to train it when using a CPU and not a GPU. So my best option is to just use my CPU on Windows? Or else, what is your recommendation?