Deeplearning voice STT - deployment issues and advise

Currently working on a custom deeplearning voice STT model that takes in voice input and converts it to text. It currently look at only english alphabets and numbers (upto 100). In English, the issue I am seeing is that the model currently breaks down upon deployment especially when multiple users try to concurrently use the STT service (lag seen at 3-4 users, system breaks down completely around 8-10 users parallelly and comes to a standstill).

I am using Node JS as the backend (not ideal for multi-threading) and am hosting the STT service on K8s containers (multiple instances). I have also tried GPU optimized VMs on the cloud but see the same issue.

The model works flawlessly when testing out features locally but it simply lags on deployment, esp when multiple users test it, or if the speech is too long (thus increasing the inference time).

The likely bottleneck after careful analysis seems to be the STT Process that is running on the backend server, which is computationally expensive leading to the main thread getting blocked. Multi-threading is an option but potentially requires careful research and not even sure if the npm package used potentially supports it or not.

Our deeplearning model details are as follows:

m- binary size - 1 MB
lm trie size- 79 KB
output_graph.pbmm size - 180 MB

Kindly let me know if any other details are required.

Questions that I would like to get a handle on:

  1. Having thrown the kitchen sink at this model (even a 112GB RAM VM , 12 cores with GPU), the model throws a CUDA memory timeout error even with as less as 4 concurrent users. Smell test says it’s not the infrastructure

  2. Having tried multi-threading with a simple load balancer so that all cores get different inputs if the inputs are parallell also seems to be a hit and miss. Plus, this approach also guarantees that only 12 concurrent users can use this service despite the extensive resources.

  3. I did try reducing the beam width to get some breathing room on memory used but the gains were marginal at best. What levers can be changed so that this STT service can be scaled for at least a few hundred users?

Did you mean “CUDA out of memory error”?
The memory mapped models (.pbmm) are fast but eat a lot of RAM compared to the .pb-models. The .pbmm-model I am running uses 20MB GPU-RAM. Firing up multiple DeepSpeech instance will multiply the GPU-RAM requirement.
How many GPUs do you have and how much GPU-RAM?

2 Likes

Adding to @dkreutz, search the forum for multiple and concurrent. There was some talk about it in a couple posts, I remember that one DeepSpeech instance needs one GPU. And I would switch to Python as I don’t know whether this is a node problem. I had a Python script using several DeepSpeech instances without memory problems.

What version do you use and give some more info if you need more help?

Thank you for the response. I have been reading and asking around, and one approach that I am considering, separating out the app from the model into a separate microservice and scale it on GPU enabled instances separately.

The Node js version is 12.x. The deepspeech mode is .pbmm and 0.6.1

Yeah, but I rented a pretty decent GPU-backed machine for testing

I used the NC12, which has considerable GPU resources but keep seeing the CUDA out of memory error with concurrent requests. Hardly 6-8 maximum could be handled before i saw the error

Thanks for the update, once you have a working solution it would be great if you could post it here or in the linked post.

You are using a 0.6.1 model, which is incompatible with faster newer DS versions. It might be cheaper to retrain in the long run.