Currently working on a custom deeplearning voice STT model that takes in voice input and converts it to text. It currently look at only english alphabets and numbers (upto 100). In English, the issue I am seeing is that the model currently breaks down upon deployment especially when multiple users try to concurrently use the STT service (lag seen at 3-4 users, system breaks down completely around 8-10 users parallelly and comes to a standstill).
I am using Node JS as the backend (not ideal for multi-threading) and am hosting the STT service on K8s containers (multiple instances). I have also tried GPU optimized VMs on the cloud but see the same issue.
The model works flawlessly when testing out features locally but it simply lags on deployment, esp when multiple users test it, or if the speech is too long (thus increasing the inference time).
The likely bottleneck after careful analysis seems to be the STT Process that is running on the backend server, which is computationally expensive leading to the main thread getting blocked. Multi-threading is an option but potentially requires careful research and not even sure if the npm package used potentially supports it or not.
Our deeplearning model details are as follows:
m- binary size - 1 MB
lm trie size- 79 KB
output_graph.pbmm size - 180 MB
Kindly let me know if any other details are required.
Questions that I would like to get a handle on:
-
Having thrown the kitchen sink at this model (even a 112GB RAM VM , 12 cores with GPU), the model throws a CUDA memory timeout error even with as less as 4 concurrent users. Smell test says it’s not the infrastructure
-
Having tried multi-threading with a simple load balancer so that all cores get different inputs if the inputs are parallell also seems to be a hit and miss. Plus, this approach also guarantees that only 12 concurrent users can use this service despite the extensive resources.
-
I did try reducing the beam width to get some breathing room on memory used but the gains were marginal at best. What levers can be changed so that this STT service can be scaled for at least a few hundred users?