Running multiple inferences in parallel on a GPU

We are running into an issue with trying to run multiple inferences in parallel on a GPU. By using torch multiprocessing we have made a script that creates a queue and run ‘n’ number of processes.

When setting ‘n’ to greater than 2 we run into errors to do with lack of memory, from a bit of research on the discourse we’ve figured out that this is due to tensorflow allocating all of the GPU memory to itself when it initialises the session.

We know how to alter the ‘use_allow_growth’ flag in the flags.py which as we understand is basically just adding changing the tf.ConfigProto() to add

config.gpu_options.allow_growth = True

but that seems to only apply to the training method and not the inference method.

How and where can we alter the tf.ConfigProto() to be able to utilise this tensorflow method in order to be able to take full advantage of the GPU memory with many multiple processes?

(This is using v0.5.1 and the pre-trained model associated with it)

Add something to the effect of: options.config.mutable_gpu_options().set_allow_growth(true);

1 Like

Thank you for the reply. Will we have to go through the steps of rebuilding the package for this to take effect? Or will this somehow be read just by running the python native client scripts?

You’ll need to rebuild, as this is C++ code.

Excellent thank you. We will try this

After rebuilding and trying to run 2 parallel processes we notice that one of the processes running still tries to allocate all the GPU memory available meaning we still run into the same out of memory error

image

    2019-11-28 10:53:55.315564: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
    2019-11-28 10:53:55.550252: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 13.69G (14699583744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.551025: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 12.32G (13229624320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.551784: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 11.09G (11906661376 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.552518: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 9.98G (10715995136 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.553244: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.98G (9644395520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.553949: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 8.08G (8679955456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.554668: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.28G (7811959808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.555398: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.55G (7030763520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.556143: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.89G (6327687168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.556854: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 5.30G (5694918144 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.557579: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.77G (5125426176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.558281: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.30G (4612883456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.559010: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.87G (4151595008 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.559719: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.48G (3736435456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.560427: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.13G (3362791936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.561154: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.82G (3026512640 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.561890: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.54G (2723861248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.562617: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.28G (2451474944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.563371: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.05G (2206327296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.564074: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.85G (1985694464 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.564774: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.66G (1787124992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.565476: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.50G (1608412416 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.566201: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.35G (1447571200 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.566917: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.21G (1302814208 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.567654: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.09G (1172532736 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.568357: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1006.39M (1055279616 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.569082: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 905.75M (949751808 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.569801: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 815.18M (854776576 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:55.570519: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 733.66M (769298944 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
    2019-11-28 10:53:57.587464: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
    2019-11-28 10:53:57.823504: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
    2019-11-28 10:53:57.853637: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.856427: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.858252: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.859887: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.861685: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.863990: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.864661: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.866457: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.868445: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.870251: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.995128: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.995181: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
    Error running session: Internal: Blas GEMM launch failed : a.shape=(16, 494), b.shape=(494, 2048), m=16, n=2048, k=494
             [[{{node MatMul}}]]
             [[{{node logits}}]]
    2019-11-28 10:53:57.995617: I tensorflow/stream_executor/stream.cc:2079] [stream=0x12a00170,impl=0x772ad70] did not wait for [stream=0x772ac90,impl=0x772a920]
    2019-11-28 10:53:57.995622: I tensorflow/stream_executor/stream.cc:2079] [stream=0x125d9ae0,impl=0x12a2d610] did not wait for [stream=0x772ac90,impl=0x772a920]
    2019-11-28 10:53:57.995700: I tensorflow/stream_executor/stream.cc:5027] [stream=0x12a00170,impl=0x772ad70] did not memcpy host-to-device; source: 0x178cbb00
    2019-11-28 10:53:57.995713: I tensorflow/stream_executor/stream.cc:5014] [stream=0x125d9ae0,impl=0x12a2d610] did not memcpy device-to-host; source: 0x7fa6de002500
    2019-11-28 10:53:57.995741: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed
    2019-11-28 10:53:57.997924: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
    2019-11-28 10:53:57.997954: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support
    2019-11-28 10:53:57.997983: I tensorflow/stream_executor/stream.cc:2079] [stream=0x12153fc0,impl=0x12154060] did not wait for [stream=0x10e21be0,impl=0x68b1600]
    2019-11-28 10:53:57.998011: I tensorflow/stream_executor/stream.cc:5014] [stream=0x12153fc0,impl=0x12154060] did not memcpy device-to-host; source: 0x7fd539457400
    2019-11-28 10:53:57.998142: F tensorflow/core/common_runtime/gpu/gpu_util.cc:292] GPU->CPU Memcpy failed

Sorry for the huge block of error message but I thought it would be relevant. Would you have any insight as to why this would be happening despite the rebuild with the changes to the tensorflow config?

Your nvidia-smi output (please please please avoid screenshots) shows two python processes, and not a deepspeech one.

Can you explain what you are doing ?

Specifically, how are you runing your inference here ?

We’re running it as an imported package from a python script that does the multiprocessing.

In the multiprocessing script we import ‘Model’ and then initialise it in the following way:

ds = Model(
    MODEL_ROOT_DIR + "output_graph.pb",
    N_FEATURES,
    N_CONTEXT,
    MODEL_ROOT_DIR + "alphabet.txt",
    BEAM_WIDTH)

ds.enableDecoderWithLM(MODEL_ROOT_DIR + "alphabet.txt",
                        MODEL_ROOT_DIR + "lm.binary",
                        MODEL_ROOT_DIR + "trie",
                        LM_WEIGHT,
                        VALID_WORD_COUNT_WEIGHT)

and then use the following:

fs, audio = wav.read("audio/test_16kHz.wav")
transcript = ds.stt(audio, fs)

for the inference

Can you describe your steps from rebuilding with the changes applied to running ?

So we altered the code then followed the instructions here:

to rebuild and install the package to a virtual environment then using torch multiprocessing we initialised two processes with the target being the function described in my previous comment about inference

That’s the part that we need to verify for sure.

That’s the part that we need to verify for sure.

We’ve definitely verified that because we put a stdout in the file and it was printed on execution

According to the comment, I modified tfmodelstate.cc as follows and it works fine.
libdeepspeech.so can be used from multiple processes.
Thanks

SessionOptions options;
options.config.mutable_gpu_options()->set_allow_growth(true);

Could you provide a bit of clarity on where the tfmodelstate.cc file is?

tfmodelstate.cc is for master, it did not exist in v0.5.1. Also, make sure you’re reading the documentation for v0.5.1 as well. In the post above you linked to the master documentation. That could be why the change isn’t working for you, you followed the wrong steps maybe.

In the post above you linked to the master documentation.

That was just my fault with linking quickly we did follow the README.md found in the v0.5.1 repo. If it doesn’t exist in the release version and only exists in the alpha and the changes made to the deepspeech.cc file have not taken effect is there any action you could suggest to either diagnose or solve the problem?

The changes made to deepspeech.cc should apply if you rebuild libdeepspeech.so as well as the Python package you’re using. If it is applied, but not doing what you expected, then that’s a TensorFlow bug, as there’s nothing we can do beyond setting the flag in the Session options.