We are using AWS EC2 instances behind a load balancer in order to try to run DS at scale. We are using the latest stable release of DS (0.9.3) with the nodeJS bindings.
The AWS instances we are using are C5as with the following spec:
c5.large 2vCPU 4GB RAM
We found that when we ran DS behind a Node express router, the computational complexity of DS would block the flow and prevent additional requests from being processed properly.
So now we are forking a child process, and running DS in the child process when a request comes in.
We also have a rudimentary queue system which limits the concurrent number of DS child processes to 2. We have found that if we set this limit higher, the system actually performs worse, not better.
So here is my question…
What is the connection between RAM / num vCPUs and the number of DS forked child processes that will run smoothly? Does anyone have experience optimizing a stack like this?
I’m not an expert, but make sure in your setup you are loading the model only once, then using it repeatedly for inference. If you need to load the model for each inference your performance will be very bad. So those child processes need to have a model loaded and not be short-running processes.
I have the same question (and I do not have the answer), regarding server resources DS needs, RAM/CPU (and disk) .
But let me try to clarify few collateral points:
We found that when we ran DS behind a Node express router, the computational complexity of DS would block the flow and prevent additional requests from being processed properly.
So now we are forking a child process, and running DS in the child process when a request comes in.
So far I do not know how DS runtime works behind the hood, but monitoring DS CPU usage on my PC, I guess there is a multi-thread processing (more CPUs you have, short elapsed you get).
Now, designing a nodejs server, I propose you some suggestions:
Spawning an external deepspeech process is not the best option in terms of performances. A better option is
to use DS NodeJs native client API,
creating an async function that DO NOT BLOCK your (express) server main thread.
BTW, just today I delivered my DeepSpeechJs/ microproject to demonstrate exactly that concept:
BTW, As someone suggested before in this thread, its a good idea to initialize the model once, avoiding to re-init for each HTTP request. That’s why I made 2 entry points:
How to supply the audio content to DS API at runtime?
I see two options:
(WAV) audio files
audio buffers
In my function deepSpeechTranscript I consider to have in input an audio file, but to have shortest elapsed times I guess that passing memory buffers is time saving.
You have to think about “where” are the server clients… considering end-to-end elapsed times. So by example, if speech is an audio message recorded on a web browser, you maybe want to send the audio as binary data (through websockets as suggested, or socketio, that’s my preferred solution), eventually translating audio codec to generate an audio buffer in the format DS needs.
I hope this helps and your main question still remains: there is any thumb-rule about how to size DS server resources, also, with or without a GPU?