CUDA OOM with training

Looking through the discussions here and issues on github, I noticed some threads on OOM problems. Unfortunately, none of it appeared helpful.

I am using the latest DeepSpeech clone, tensorflow-gpu 1.1.4, ubuntu 18.04 on a rig with 4 GTX 1080Ti 12 GB, Cuda 10.2, and Intel Xeon CPU E5-2650 v2 @ 2.60GHz with 64 GB RAM.

For testing purposes I use the “test.csv” set of common voice 2, after importing the sounds files with import_cv2.py.

Using default settings, which AFAIK means a batch size of 1, i get the following error:

tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory

And here is the terminal output after running DeepSpeech.py:

(nlp-ds) orchestrate@gpurig:~/projects/DeepSpeech$ ./DeepSpeech.py --train_files /home/orchestrate/projects/corpora/common_voice_2/clips/test.csv
W0726 11:20:57.210319 140243572524864 deprecation_wrapper.py:119] From /mnt/sdb/projects/DeepSpeech/util/config.py:60: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0726 11:20:57.518586 140243572524864 deprecation.py:323] From /home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:494: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
options available in V2.
- tf.py_function takes a python function which manipulates tf eager
tensors instead of numpy arrays. It’s easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_functions can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
- tf.numpy_function maintains the semantics of the deprecated tf.py_func
(it is not differentiable, and manipulates numpy arrays). It drops the
stateful argument making all functions stateful.

W0726 11:20:57.625662 140243572524864 deprecation.py:323] From /home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py:348: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_types(iterator).
W0726 11:20:57.625896 140243572524864 deprecation.py:323] From /home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py:349: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_shapes(iterator).
W0726 11:20:57.626062 140243572524864 deprecation.py:323] From /home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py:351: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.data.get_output_classes(iterator).
W0726 11:20:57.750851 140243572524864 deprecation.py:506] From /home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0726 11:20:58.883209 140243572524864 deprecation.py:323] From /home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
I Initializing variables…
2019-07-26 11:21:02.797034: F tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory
Fatal Python error: Aborted

Thread 0x00007f8b86ffd700 (most recent call first):
File “/home/orchestrate/anaconda3/lib/python3.6/threading.py”, line 295 in wait
File “/home/orchestrate/anaconda3/lib/python3.6/queue.py”, line 164 in get
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 159 in run
File “/home/orchestrate/anaconda3/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/home/orchestrate/anaconda3/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f8b877fe700 (most recent call first):
File “/home/orchestrate/anaconda3/lib/python3.6/threading.py”, line 295 in wait
File “/home/orchestrate/anaconda3/lib/python3.6/queue.py”, line 164 in get
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 159 in run
File “/home/orchestrate/anaconda3/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/home/orchestrate/anaconda3/lib/python3.6/threading.py”, line 884 in _bootstrap

Current thread 0x00007f8d00528740 (most recent call first):
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1429 in _call_tf_sessionrun
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1341 in _run_fn
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1356 in _do_call
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1350 in _do_run
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1173 in _run
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 950 in run
File “./DeepSpeech.py”, line 481 in train
File “./DeepSpeech.py”, line 828 in main
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/absl/app.py”, line 251 in _run_main
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/absl/app.py”, line 300 in run
File “/home/orchestrate/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py”, line 40 in run
File “./DeepSpeech.py”, line 844 in
Aborted (core dumped)

Even after creating a new conda environment and reinstalling all the DeepSpeech packages from scratch, following the instructions in the readme file, the training fails exactly with the same error.

Any help or suggestion would be greatly appreciated.

I assume you mean 1.14.

TensorFlow 1.14 only supports CUDA 10.0.

Don’t use Anaconda, use virtualenv and install packages with pip. The instructions in the README never mention Anaconda.

Also, make sure no other process is using your GPU memory. It could be an old crashed process. Check with nvidia-smi.

1 Like

My bad. Switching from conda to virtualenv did the trick. Thank you!

1 Like