We are running into an issue with trying to run multiple inferences in parallel on a GPU. By using torch multiprocessing we have made a script that creates a queue and run ‘n’ number of processes.
When setting ‘n’ to greater than 2 we run into errors to do with lack of memory, from a bit of research on the discourse we’ve figured out that this is due to tensorflow allocating all of the GPU memory to itself when it initialises the session.
We know how to alter the ‘use_allow_growth’ flag in the flags.py which as we understand is basically just adding changing the tf.ConfigProto() to add
config.gpu_options.allow_growth = True
but that seems to only apply to the training method and not the inference method.
How and where can we alter the tf.ConfigProto() to be able to utilise this tensorflow method in order to be able to take full advantage of the GPU memory with many multiple processes?
(This is using v0.5.1 and the pre-trained model associated with it)
Thank you for the reply. Will we have to go through the steps of rebuilding the package for this to take effect? Or will this somehow be read just by running the python native client scripts?
After rebuilding and trying to run 2 parallel processes we notice that one of the processes running still tries to allocate all the GPU memory available meaning we still run into the same out of memory error
2019-11-28 10:53:55.315564: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-11-28 10:53:55.550252: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 13.69G (14699583744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.551025: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 12.32G (13229624320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.551784: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 11.09G (11906661376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.552518: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 9.98G (10715995136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.553244: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.98G (9644395520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.553949: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.08G (8679955456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.554668: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.28G (7811959808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.555398: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.55G (7030763520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.556143: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.89G (6327687168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.556854: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.30G (5694918144 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.557579: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.77G (5125426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.558281: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.30G (4612883456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.559010: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.87G (4151595008 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.559719: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.48G (3736435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.560427: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.13G (3362791936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.561154: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.82G (3026512640 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.561890: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.54G (2723861248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.562617: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.28G (2451474944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.563371: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.05G (2206327296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.564074: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.85G (1985694464 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.564774: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.66G (1787124992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.565476: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.50G (1608412416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.566201: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.35G (1447571200 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.566917: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.21G (1302814208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.567654: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.09G (1172532736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.568357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1006.39M (1055279616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.569082: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 905.75M (949751808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.569801: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 815.18M (854776576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:55.570519: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 733.66M (769298944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-11-28 10:53:57.587464: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-11-28 10:53:57.823504: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-11-28 10:53:57.853637: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.856427: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.858252: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.859887: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.861685: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.863990: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.864661: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.866457: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.868445: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.870251: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.995128: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.995181: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
Error running session: Internal: Blas GEMM launch failed : a.shape=(16, 494), b.shape=(494, 2048), m=16, n=2048, k=494
[[{{node MatMul}}]]
[[{{node logits}}]]
2019-11-28 10:53:57.995617: I tensorflow/stream_executor/stream.cc:2079] [stream=0x12a00170,impl=0x772ad70] did not wait for [stream=0x772ac90,impl=0x772a920]
2019-11-28 10:53:57.995622: I tensorflow/stream_executor/stream.cc:2079] [stream=0x125d9ae0,impl=0x12a2d610] did not wait for [stream=0x772ac90,impl=0x772a920]
2019-11-28 10:53:57.995700: I tensorflow/stream_executor/stream.cc:5027] [stream=0x12a00170,impl=0x772ad70] did not memcpy host-to-device; source: 0x178cbb00
2019-11-28 10:53:57.995713: I tensorflow/stream_executor/stream.cc:5014] [stream=0x125d9ae0,impl=0x12a2d610] did not memcpy device-to-host; source: 0x7fa6de002500
2019-11-28 10:53:57.995741: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
2019-11-28 10:53:57.997924: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2019-11-28 10:53:57.997954: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
2019-11-28 10:53:57.997983: I tensorflow/stream_executor/stream.cc:2079] [stream=0x12153fc0,impl=0x12154060] did not wait for [stream=0x10e21be0,impl=0x68b1600]
2019-11-28 10:53:57.998011: I tensorflow/stream_executor/stream.cc:5014] [stream=0x12153fc0,impl=0x12154060] did not memcpy device-to-host; source: 0x7fd539457400
2019-11-28 10:53:57.998142: F tensorflow/core/common_runtime/gpu/gpu_util.cc:292] GPU->CPU Memcpy failed
Sorry for the huge block of error message but I thought it would be relevant. Would you have any insight as to why this would be happening despite the rebuild with the changes to the tensorflow config?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
Your nvidia-smi output (please please please avoid screenshots) shows two python processes, and not a deepspeech one.
Can you explain what you are doing ?
Specifically, how are you runing your inference here ?
So we altered the code then followed the instructions here:
to rebuild and install the package to a virtual environment then using torch multiprocessing we initialised two processes with the target being the function described in my previous comment about inference
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
tfmodelstate.cc is for master, it did not exist in v0.5.1. Also, make sure you’re reading the documentation for v0.5.1 as well. In the post above you linked to the master documentation. That could be why the change isn’t working for you, you followed the wrong steps maybe.
In the post above you linked to the master documentation.
That was just my fault with linking quickly we did follow the README.md found in the v0.5.1 repo. If it doesn’t exist in the release version and only exists in the alpha and the changes made to the deepspeech.cc file have not taken effect is there any action you could suggest to either diagnose or solve the problem?
The changes made to deepspeech.cc should apply if you rebuild libdeepspeech.so as well as the Python package you’re using. If it is applied, but not doing what you expected, then that’s a TensorFlow bug, as there’s nothing we can do beyond setting the flag in the Session options.
ERROR: /home/administrator/deepspeech/tensorflow/tensorflow/core/kernels/BUILD:4192:1: error while parsing .d file: /home/administrator/.cache/bazel/_bazel_administrator/f0ef65007c45462e3bb61b45513f09ae/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/core/kernels/_objs/softmax_op_gpu/softmax_op_gpu.cu.pic.d (No such file or directory)
nvcc fatal : Could not open output file ‘/tmp/tmpxft_000065a2_00000000’
INFO: Elapsed time: 34.680s, Critical Path: 25.55s
INFO: 130 processes: 130 local.
FAILED: Build did NOT complete successfully