Connection between RAM / num vCPUs and DS performance as a node forked child process

We are using AWS EC2 instances behind a load balancer in order to try to run DS at scale. We are using the latest stable release of DS (0.9.3) with the nodeJS bindings.

The AWS instances we are using are C5as with the following spec:

c5.large 2vCPU 4GB RAM

We found that when we ran DS behind a Node express router, the computational complexity of DS would block the flow and prevent additional requests from being processed properly.

So now we are forking a child process, and running DS in the child process when a request comes in.

We also have a rudimentary queue system which limits the concurrent number of DS child processes to 2. We have found that if we set this limit higher, the system actually performs worse, not better.

So here is my question…
What is the connection between RAM / num vCPUs and the number of DS forked child processes that will run smoothly? Does anyone have experience optimizing a stack like this?

Many thanks in advance

Paul

I’m not an expert, but make sure in your setup you are loading the model only once, then using it repeatedly for inference. If you need to load the model for each inference your performance will be very bad. So those child processes need to have a model loaded and not be short-running processes.

Hi Mike, thanks for that. Are you talking about the acoustic model or language model? Cheers.

ok, the problem i’m facing here is that only json data can be passed to a forked node child process, and the model is not json data…

setup a websocket, this is how I dealt with this problem on a setup for a rpi4 with webthings gateway

forks here: https://github.com/lissyx/voice-addon/blob/ds/ds-api-handler.js#L27-L30

then uses websocket to push data directly from the mic to the child process:


Hi Paul

I have the same question (and I do not have the answer), regarding server resources DS needs, RAM/CPU (and disk) .

But let me try to clarify few collateral points:

We found that when we ran DS behind a Node express router, the computational complexity of DS would block the flow and prevent additional requests from being processed properly.

So now we are forking a child process, and running DS in the child process when a request comes in.

So far I do not know how DS runtime works behind the hood, but monitoring DS CPU usage on my PC, I guess there is a multi-thread processing (more CPUs you have, short elapsed you get).

Now, designing a nodejs server, I propose you some suggestions:

  1. Spawning an external deepspeech process is not the best option in terms of performances. A better option is

    • to use DS NodeJs native client API,
    • creating an async function that DO NOT BLOCK your (express) server main thread.

    BTW, just today I delivered my DeepSpeechJs/ microproject to demonstrate exactly that concept:

BTW, As someone suggested before in this thread, its a good idea to initialize the model once, avoiding to re-init for each HTTP request. That’s why I made 2 entry points:

  • deepSpeechInitialize
  • deepSpeechTranscript

See: DeepSpeechJs/deepSpeechTranscriptNative.js at master · solyarisoftware/DeepSpeechJs · GitHub

  1. How to supply the audio content to DS API at runtime?

    I see two options:

    • (WAV) audio files
    • audio buffers

    In my function deepSpeechTranscript I consider to have in input an audio file, but to have shortest elapsed times I guess that passing memory buffers is time saving.

    You have to think about “where” are the server clients… considering end-to-end elapsed times. So by example, if speech is an audio message recorded on a web browser, you maybe want to send the audio as binary data (through websockets as suggested, or socketio, that’s my preferred solution), eventually translating audio codec to generate an audio buffer in the format DS needs.


I hope this helps and your main question still remains: there is any thumb-rule about how to size DS server resources, also, with or without a GPU?

giorgio

1 Like