Scaling DeepSpeech to deal with many concurrent requests

Hey all,

We are currently using deepspeech plus kenlm to offer listen and repeat and read aloud type activities to English language learners.

The problem we’re having is the time it takes for DeepSpeech to generate a response when there are multiple concurrent requests (say 20 or more students all using the system at the same time)

As our tool gains popularity, we will have even more concurrent requests, possibly even hundreds or thousands at a time.

So we need to ensure that DeepSpeech is returning a transcript result ideally within 3 to 5 seconds.

The speech segments are short: about 15 to 30 seconds each.

And we pre-generate custom scorers with kenlm, so that is not a bottle neck here.

We’d like to know if anyone else has scaled up DeepSpeech to comfortably handle 10s or 100s of concurrent requests.

Importantly, whatever solution we adopt mustn’t be prohibitively expensive.

Here are some questions we are considering:

  • should we have multiple instances of DS with a load balancer?
  • should we run DS on a node server or on Nginx with a PHP bridge (using PHP exec function to trigger DS recognition)
  • is there any significant difference in speed between these two solutions?
  • should we run DeepSpeech on a compatible GPU
  • how much memory/what size cpu are optimal?

Any advice on the above or any other information about how to effectively scale DeepSpeech (on a budget) gratefully received.

1 Like

Scaling seems to be the topic of the week, this is the third post about it in 24 hours :slight_smile: Why don’t we collect ideas in this post. Check the other two here and here.

@Paul_Raine, @Dsa, @aphorist13 please continue to post here, so we have all ideas on concurrency and scaling in one place.

@utunga I guess you have a bigger installation running, any ideas on high performance? And @lissyx or @reuben, do you have any input on how to scale DeepSpeech?

For starters, we have a couple smaller virtual CPU servers running that simply get jobs from a selfmade balancer. But it is not time critical, so it is ok for us if it takes a couple minutes.

Summary for future reference:

  • The underlying libdeepspeech.so can be accessed concurrently on CPU via the native bindings. You would have to manage the processes yourself.
1 Like

Since you don’t share any hw infos on your current setup, it’s impossible to help there.

You should first explain your current design?

there’s no batching support in libdeepspeech.so, so I don’t see how you can run anything concurrently.

Good to know, I guess you could have several DeepSpeech instances on one machine and distribute work between them.

When we invoke DeepSpeech from the bash command line, does that launch a separate instance every time, or does it try to launch the same instance that is already running?

Are we taking about C++ binary ? NodeJS ? Python ?
Please read the code and understand that those are just basic frontends to the library / bindings.

Sorry, I don’t even understand what you mean there.

@lissyx Thank you for your response. Here is a little more info about our setup.

  1. teacher inputs sentences they would like learners to practice
  2. we concatenate the sentences into a single text file, and send the text file to kenlm, which returns a scorer
  3. learner accesses the speaking activity via a web browser
  4. we record the audio via the web browser and send it via formdata (along with the pre-generated scorer) to a php script
  5. php script converts audio formdata into a format compatible with DeepSpeech, and then invokes deepspeech with the audio file and the scorer:

exec("ffmpeg -i “.escapeshellarg($_FILES[‘blob’][‘tmp_name’]).” -acodec pcm_s16le -ac 1 -ar 16000 ".$audioPath);
$transcript = shell_exec("sudo /root/DeepSpeech/deepspeech --model /root/model.pbmm --scorer “.$scorerPath.” --audio ".$audioPath);

  1. transcript is returned to browser for further processing

If you could provide any advice on how to optimize this flow, that would be appreciated.

We are invoking DeepSpeech from the command line using PHP’s shell_exec, but we are also looking into using nodejs bindings, if they are more efficient…

sorry but this is super bad:

  • NEVER run code as root, unless you wrote it
  • don’t expect good performances of a setup like that, you’re reloading the model everytime
  • we don’t have bindings for PHP, sorry

they are not going to be more efficient by themselves, it’s just a binding of libdeepspeech.so so it will be as fast. The big difference is that you can properly use the API and load the model once and run multiples inferences against it.

And given your usecase, you want to use streaming obviously …

Add threading / processes, you load one model on each, and voilà …

This is also super bad for your latency. Please explore https://github.com/mozilla/DeepSpeech-examples/ there are multiple examples streaming audio from the browser over websocket.

this could be done client-side, I think we have some examples doing it

There are quite number of DeepSpeech server implementations on Github: https://github.com/search?q=deepspeech+server

None of the implementations tackles the issue of concurrent requests. Some kind of request queueing is required, plus load-balancing in case you have multiple DS instances available.

My knowledge of application server stacks is outdated (J2EE or JBoss/Spring were the obvious choices when my job description was “web app developer”) but I am confident that now there are tons of frameworks in almost any programming language that can handle the issue.

1 Like