Flask App Code
app = Flask(__name__)
model = model.load_model()
@app.route('/api',methods=['POST'])
def predict():
# Get the data from the POST request.
data = request.stream
fin = wave.open(data)
fs = fin.getframerate()
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
audio_length = fin.getnframes() * (1/16000)
fin.close()
inference_start = timer()
solution = model.stt(audio, fs)
inference_end = timer() - inference_start
resp = {'solution':solution.strip(),'error':False,"time":inference_end}
return jsonify(resp)
if __name__ == '__main__':
app.run(host="0.0.0.0",port=5001)
I’m trying to deploy a trained DeepSpeech model with flask. For the CPU version of DeepSpeech I can use uwsgi/nginx to create multiple workers that can run the inferencing in parallel. For the GPU however, the model is too large to concurrently load multiple times. As I am not very experienced with ML deployment, I was hoping someone could explain how to deploy a GPU DeepSpeech model to be able to handle multiple concurrent requests efficiently.