If I run taskset -c 0,1 deepspeech model audio , it takes 27 seconds.
If I run taskset -c 0,1 deepspeech model audio and taskset -c 2,3 deepspeech model audio in 2 separate shell windows, both inference time becomes 50 seconds for the same audio. Why ?
RAM is not any issue as I have more than 50% RAM available always. OS is ubuntu 16.04.
Why the time increases even when running on mutually exclusive CPUs ?
But it’s possible that those are not able to cope with the load. Also one core running will be boosted at 3.10GHz while two cores will be capped at 2.5GHz.
Maybe the bottleneck is IO? The graph file size is substantial. You could try modifying the clients to act more like a server, loading the graph and then waiting for several inputs. That’d reduce the impact of IO on each run.
Deepspeech binary prints the inference time like below:
Inference took 26.811s for 21.454s audio file.
Loading model as seen in the output is as below: Loaded model in 1.793s.
So, I am reporting the inference time that keeps increasing while running multiple runs.
We tried printing execution time.
The execution time for below code in deepspeech.cc increases as number of parallel runs increase using taskset on mutually exclusive cpus.
mPriv->session->Run(
{{ “input_node”, input }, { “input_lengths”, n_frames }},
{“logits”}, {}, &outputs);
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
That’s just a snapshot, one should look at the behavior during the whole process execution. Also, it could be the scheculer / cgroups sharing CPUs in some way?
Ok. But if I run sysbench --test=cpu --cpu-max-prime=20000 run, I get the same time in multiple processes running simultaneously on mutually exclusive cpus. Will do more research and get back.
We are trying to figure out how it can be scaled in production. Unless we can do parallel processing it looks like cost is going to be too high. If it can efficiently do prallel processing then inference time can be reduced (by splitting the audio) and hence cheaper and faster value proposition…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
18
There would be a lot of other changes to perform if you want to use that in production. Remember it’s still all alpha release. And yes, the model consumes a lot of power, that’s granted. Using GPUs would probably help a lot, but still you would need to design that a bit better, probably using some server bits. Some people contributed that (Python, NodeJS), “deepspeech-server”.
What kind of production usecase are you targetting? What contraints are there?
We tried the server stuff using python flask, but still the same longer inference time. May be missing something there, would give it a shot gain and will get back on this. Basically if you see with the current inference time and the amount of resource the model uses, it is impossible to do real-time inference on a single machine with multiple users. The number of machines will be the limitation to how many concurrent inferences can be done at a time.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
20