How to efficiently run concurrent inferences with DeepSpeech model

RiemannZeta · February 19, 2019, 7:41pm

Flask App Code

app = Flask(__name__)
model = model.load_model()


@app.route('/api',methods=['POST'])
def predict():
	# Get the data from the POST request.
	data = request.stream
	fin = wave.open(data)
	fs = fin.getframerate()
	audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
	audio_length = fin.getnframes() * (1/16000)
	fin.close()
	inference_start = timer()
	solution = model.stt(audio, fs)
	inference_end = timer() - inference_start
	resp = {'solution':solution.strip(),'error':False,"time":inference_end}
	return jsonify(resp)
if __name__ == '__main__':
    app.run(host="0.0.0.0",port=5001)

I’m trying to deploy a trained DeepSpeech model with flask. For the CPU version of DeepSpeech I can use uwsgi/nginx to create multiple workers that can run the inferencing in parallel. For the GPU however, the model is too large to concurrently load multiple times. As I am not very experienced with ML deployment, I was hoping someone could explain how to deploy a GPU DeepSpeech model to be able to handle multiple concurrent requests efficiently.

lissyx · February 20, 2019, 9:15am

As much as I can tell, I don’t think any sharing is possible. You could try to load multiple models as long as they all fit in your GPU memory, but my experiments on that were not very conclusive.