We are currently using deepspeech plus kenlm to offer listen and repeat and read aloud type activities to English language learners.
The problem we’re having is the time it takes for DeepSpeech to generate a response when there are multiple concurrent requests (say 20 or more students all using the system at the same time)
As our tool gains popularity, we will have even more concurrent requests, possibly even hundreds or thousands at a time.
So we need to ensure that DeepSpeech is returning a transcript result ideally within 3 to 5 seconds.
The speech segments are short: about 15 to 30 seconds each.
And we pre-generate custom scorers with kenlm, so that is not a bottle neck here.
We’d like to know if anyone else has scaled up DeepSpeech to comfortably handle 10s or 100s of concurrent requests.
Importantly, whatever solution we adopt mustn’t be prohibitively expensive.
Here are some questions we are considering:
should we have multiple instances of DS with a load balancer?
should we run DS on a node server or on Nginx with a PHP bridge (using PHP exec function to trigger DS recognition)
is there any significant difference in speed between these two solutions?
should we run DeepSpeech on a compatible GPU
how much memory/what size cpu are optimal?
Any advice on the above or any other information about how to effectively scale DeepSpeech (on a budget) gratefully received.
Scaling seems to be the topic of the week, this is the third post about it in 24 hours Why don’t we collect ideas in this post. Check the other two here and here.
@Paul_Raine, @Dsa, @aphorist13 please continue to post here, so we have all ideas on concurrency and scaling in one place.
@utunga I guess you have a bigger installation running, any ideas on high performance? And @lissyx or @reuben, do you have any input on how to scale DeepSpeech?
For starters, we have a couple smaller virtual CPU servers running that simply get jobs from a selfmade balancer. But it is not time critical, so it is ok for us if it takes a couple minutes.
Summary for future reference:
The underlying libdeepspeech.so can be accessed concurrently on CPU via the native bindings. You would have to manage the processes yourself.
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
3
Since you don’t share any hw infos on your current setup, it’s impossible to help there.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
You should first explain your current design?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
there’s no batching support in libdeepspeech.so, so I don’t see how you can run anything concurrently.
When we invoke DeepSpeech from the bash command line, does that launch a separate instance every time, or does it try to launch the same instance that is already running?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
Are we taking about C++ binary ? NodeJS ? Python ?
Please read the code and understand that those are just basic frontends to the library / bindings.
Sorry, I don’t even understand what you mean there.
We are invoking DeepSpeech from the command line using PHP’s shell_exec, but we are also looking into using nodejs bindings, if they are more efficient…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
sorry but this is super bad:
NEVER run code as root, unless you wrote it
don’t expect good performances of a setup like that, you’re reloading the model everytime
we don’t have bindings for PHP, sorry
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
12
they are not going to be more efficient by themselves, it’s just a binding of libdeepspeech.so so it will be as fast. The big difference is that you can properly use the API and load the model once and run multiples inferences against it.
And given your usecase, you want to use streaming obviously …
Add threading / processes, you load one model on each, and voilà …
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
13
None of the implementations tackles the issue of concurrent requests. Some kind of request queueing is required, plus load-balancing in case you have multiple DS instances available.
My knowledge of application server stacks is outdated (J2EE or JBoss/Spring were the obvious choices when my job description was “web app developer”) but I am confident that now there are tons of frameworks in almost any programming language that can handle the issue.