How to specify it to run on single/idle gpu only?

with code below, it runs on all available (two) gpus and throws OOM since one of them is occupied by another program fully.

so how can I instruct the DeepSpeech to only run on specified gpu ONLY, not both?

nohup python -u DeepSpeech.py 
                   --alphabet_config_path=data/alphabet.txt 
                   --train_files=data/tbwavdata1/tbwavdata1_train.csv 
                   --dev_files=data/tbwavdata1/tbwavdata1_dev.csv 
                   --test_files=data/tbwavdata1/tbwavdata1_test.csv 
                   --train_batch_size=10 
                   --test_batch_size=10 
                   --dev_batch_size=10 
                   --validation_step=1 
                   --log_level=0 
                   --n_hidden=512 
                   --checkpoint_dir=checkpoints/ 
                   --epoch=20 > tbwavdata1_0711.log &

You don’t need to do anything specific to deepspeech, just rely on CUDA’s CUDA_VISIBLE_DEVICES environment variable :slight_smile:

1 Like

Yes I did and it is not working.
see below:

$nvidia-smi
Tue Jul 10 17:48:27 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 0000:06:00.0 Off | 0 |
| N/A 28C P8 17W / 250W | 0MiB / 11443MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla M40 On | 0000:87:00.0 Off | 0 |
| N/A 45C P0 101W / 250W | 10947MiB / 11443MiB | 35% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 81951 C /home/nanhai.ynh/.local/bin/python3.6 10945MiB |
±----------------------------------------------------------------------------+

You see, one GPU card is busy.
And when I run DeepSpeech, it is trying to run on both of them, which causes OOM。

D Starting coordinator…
D Coordinator started.
0
0
0
D Starting queue runners…
D Queue runners started.

WARNING: libdeepspeech failed to load, resorting to deprecated code
Refer to README.md for instructions on installing libdeepspeech

E OOM when allocating tensor with shape[78,10,512]
E [[Node: tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=[“loc:@tower_1/bidirectional_rnn/bw/bw/TensorArray”], dtype=DT_FLOAT, element_shape=[?,512], _device="/job:localhost/replica:0/task:0/device:GPU:1"](tower_1/bidirectional_rnn/bw/bw/TensorArray, tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/range, tower_1/bidirectional_rnn/bw/bw/while/Exit_1)]]
E [[Node: tower_1/gradients/tower_1/MatMul_grad/tuple/control_dependency_1/_697 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name=“edge_4787_tower_1/gradients/tower_1/MatMul_grad/tuple/control_dependency_1”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
E
E Caused by op u’tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3’, defined at:
E File “DeepSpeech.py”, line 1892, in
E tf.app.run()
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/platform/app.py”, line 133, in run
E _sys.exit(main(argv))
E File “DeepSpeech.py”, line 1849, in main
E train()
E File “DeepSpeech.py”, line 1555, in train
E results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E File “DeepSpeech.py”, line 642, in get_tower_results
E calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
E File “DeepSpeech.py”, line 523, in calculate_mean_edit_distance_and_loss
E logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E File “DeepSpeech.py”, line 460, in BiRNN
E sequence_length=seq_length)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py”, line 428, in bidirectional_dynamic_rnn
E time_major=time_major, scope=bw_scope)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py”, line 616, in dynamic_rnn
E dtype=dtype)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py”, line 795, in _dynamic_rnn_loop
E final_outputs = tuple(ta.stack() for ta in output_final_ta)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/rnn.py”, line 795, in
E final_outputs = tuple(ta.stack() for ta in output_final_ta)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py”, line 889, in stack
E return self._implementation.stack(name=name)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py”, line 288, in stack
E return self.gather(math_ops.range(0, self.size()), name=name)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/tensor_array_ops.py”, line 302, in gather
E element_shape=element_shape)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py”, line 4158, in _tensor_array_gather_v3
E flow_in=flow_in, dtype=dtype, element_shape=element_shape, name=name)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
E op_def=op_def)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py”, line 3081, in create_op
E op_def=op_def)
E File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py”, line 1528, in init
E self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
E
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[78,10,512]
E [[Node: tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=[“loc:@tower_1/bidirectional_rnn/bw/bw/TensorArray”], dtype=DT_FLOAT, element_shape=[?,512], _device="/job:localhost/replica:0/task:0/device:GPU:1"](tower_1/bidirectional_rnn/bw/bw/TensorArray, tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/range, tower_1/bidirectional_rnn/bw/bw/while/Exit_1)]]
E [[Node: tower_1/gradients/tower_1/MatMul_grad/tuple/control_dependency_1/_697 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name=“edge_4787_tower_1/gradients/tower_1/MatMul_grad/tuple/control_dependency_1”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
E
Traceback (most recent call last):
File “DeepSpeech.py”, line 1649, in train
step = session.run(global_step, feed_dict=feed_dict)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py”, line 524, in run
run_metadata=run_metadata)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py”, line 996, in run
run_metadata=run_metadata)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py”, line 1087, in run
raise six.reraise(*original_exc_info)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py”, line 1072, in run
return self._sess.run(*args, **kwargs)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py”, line 1144, in run
run_metadata=run_metadata)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py”, line 924, in run
return self._sess.run(*args, **kwargs)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 889, in run
run_metadata_ptr)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 1120, in _run
feed_dict_tensor, options, run_metadata)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 1317, in _do_run
options, run_metadata)
File “/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 1336, in _do_call
raise type(e)(node_def, op, message)
ResourceExhaustedError: OOM when allocating tensor with shape[78,10,512]
[[Node: tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=[“loc:@tower_1/bidirectional_rnn/bw/bw/TensorArray”], dtype=DT_FLOAT, element_shape=[?,512], _device="/job:localhost/replica:0/task:0/device:GPU:1"](tower_1/bidirectional_rnn/bw/bw/TensorArray, tower_1/bidirectional_rnn/bw/bw/TensorArrayStack/range, tower_1/bidirectional_rnn/bw/bw/while/Exit_1)]]
[[Node: tower_1/gradients/tower_1/MatMul_grad/tuple/control_dependency_1/_697 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name=“edge_4787_tower_1/gradients/tower_1/MatMul_grad/tuple/control_dependency_1”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

so, how can I avoid this?

thanks

First, avoid using pictures. It’s not readable nor indexable. Second, you don’t document your command-line, so no way to check what you did: OOM could be triggered because of something else.

I updated the post.
And I am pretty sure OOM is triggered due to lack of memory on the occupied GPU.
since I have run it many times successfully.

I’m not saying it’s not, I’m saying it might not be the only reason. But you still have not provided your command line.

actually I did, it was in my first post.
here, I post it again!

Well, your console output does not include any of the expected TensorFlow-level CUDA-related output. And this command-line does not include any env var, so I still cannot check what you are doing …

:stuck_out_tongue_winking_eye:
how can I provide those environmental info and cuda-related output?
thank you for being so patient with me!

Just paste everything you type and that gets printed ?