Multi-processsing inference with multi-GPU setup

maria.khan · June 9, 2020, 12:21pm

Hi,

I have been attempting multi-processing inference with a multi-GPU setup with some issues and was hoping to get some advice on how to solve this issue. First let me clarify that I am working with Deepspeech 0.6.1 with Python 3.6.9.

Following the advice on this thread: Running multiple inferences in parallel on a GPU

I managed to get inference in parallel on the GPU without tensorflow taking up the whole GPU memory with each inference attempt.

However, when I follow the same instructions and rebuild deepspeech on a multi-GPU set up, the allow_growth seems to no longer work and I find that all the memory gets taken up and tensorflow flags out of memory errors.

Below is an image of what watch nvidia-smi shows - this is with running only 1 process (so no paralellising has happened yet but it’s running as a separate process from the main):

My code is as follows:

import torch
import torch.multiprocessing as tmp
from transcription_gpu import run_transcription
import argparse
import time
PROCESS_NUM = 2

if __name__ == '__main__':
    before = time.time()

    # Setup a number of processes
    processes = [tmp.Process(target=run_transcription, args=()) for x in range(1, PROCESS_NUM+1)]

    # Run processes
    for p in processes:
        print("I am a process!")
        p.start()

    # Exit the completed processes
    for p in processes:
        p.join()

    after = time.time()
    print("Processing Time: ", after-before)

import scipy.io.wavfile as wav
import sys
import os
import time
from deepspeech import Model

def run_transcription():

    BEAM_WIDTH = 500
    LM_WEIGHT = 1.50
    VALID_WORD_COUNT_WEIGHT = 2.10
    N_FEATURES = 26
    N_CONTEXT = 9
    MODEL_ROOT_DIR = 'models/deepspeech-0.6.1-models/'

    ds = Model(
        MODEL_ROOT_DIR + 'output_graph.pb',
        BEAM_WIDTH)

    ds.enableDecoderWithLM(
        MODEL_ROOT_DIR + 'lm.binary',
        MODEL_ROOT_DIR + 'trie',
        LM_WEIGHT,
        VALID_WORD_COUNT_WEIGHT)

    before = time.time()
    fs, audio = wav.read("audio/test.wav")
    transcript = ds.stt(audio)
    print(transcript)
    after = time.time()
    print("Transcription Time: ", after-before)

lissyx · June 9, 2020, 4:21pm

It’s unclear: are we talking about inference? training? What did you rebuild? How do you set allow_growth?

lissyx · June 9, 2020, 4:24pm

We already have similar code in evaluate_tflite.py

maria.khan · June 9, 2020, 8:05pm

It’s unclear: are we talking about inference? training? What did you rebuild? How do you set allow_growth ?

I stated at the beginning of my post that I mean inference. I rebuilt Deepspeech 0.6.1 by altering the tfmodelstate.cc as indicted in the post that I linked with such a solution.

The edit was as follows:

options.config.mutable_graph_options()
      ->mutable_optimizer_options()
      ->set_opt_level(::OptimizerOptions::L0);    

    options.config.mutable_gpu_options()
      ->set_allow_growth(true);

We already have similar code in evaluate_tflite.py

Does this work for multi-gpu setups?

lissyx · June 9, 2020, 8:18pm

I don’t see anything multi-GPU specific in your code.

That look okay but if you say it’s not working, maybe there are details to investigate.
There’s a TF_FORCE_GPU_ALLOW_GROWTH environment variable, maybe it would work?

Sharing GPUs between processes is kind of complicated with tensorflow …

maria.khan · June 9, 2020, 8:31pm

This is my point, I am unclear how to achieve this with deepspeech. Whether I need to edit deepspeech code itself or change the way I’m using the package in my python code. It is the point of my post here.

Where do I call this force variable? The session gets generated from the imported methods from Deepspeech, so I don’t know if I have access to it to apply that variable.

Would it be better to create a pool that randomly allocates to different GPUs? I guess I’m wondering why the set_allow_growth flag works fine on my single-GPU setup but fails on my multi-GPU set up even with only one process going.

reuben · June 10, 2020, 10:35am

Easiest way is probably to spawn several independent processes, each one with CUDA_VISIBLE_DEVICES environment variable limiting it to see a single GPU.

reuben · June 10, 2020, 10:35am

But you can also further edit the GPUOptions to do the same thing by changing the visible_device_list attribute: https://github.com/mozilla/tensorflow/blob/ceb46aae5836a0f648a2c3da5942af2b7d1b98bf/tensorflow/core/protobuf/config.proto#L57-L78

lissyx · June 10, 2020, 11:04am

I’m also curious, it’d be great if you can look more in details on that