Running multiple inferences in parallel on a single machine?

praveeny1986 · January 23, 2018, 9:10am

If I run taskset -c 0,1 deepspeech model audio , it takes 27 seconds.

If I run taskset -c 0,1 deepspeech model audio and taskset -c 2,3 deepspeech model audio in 2 separate shell windows, both inference time becomes 50 seconds for the same audio. Why ?

RAM is not any issue as I have more than 50% RAM available always. OS is ubuntu 16.04.

Why the time increases even when running on mutually exclusive CPUs ?

Thanks

lissyx · January 23, 2018, 9:13am

What’s your exact hardware ?

praveeny1986 · January 23, 2018, 9:18am

Using cpu version of DeepSpeech.

Hardware details: (lscpu)

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel® Core™ i5-7200U CPU @ 2.50GHz
Stepping: 9
CPU MHz: 498.550
CPU max MHz: 3100.0000
CPU min MHz: 400.0000
BogoMIPS: 5423.88
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

lissyx · January 23, 2018, 9:44am

Your CPU has two cores, capable of running 4 threads: https://ark.intel.com/products/95443/Intel-Core-i5-7200U-Processor-3M-Cache-up-to-3_10-GHz

But it’s possible that those are not able to cope with the load. Also one core running will be boosted at 3.10GHz while two cores will be capped at 2.5GHz.

It could also be I/O that are limiting?

reuben · January 23, 2018, 9:45am

Maybe the bottleneck is IO? The graph file size is substantial. You could try modifying the clients to act more like a server, loading the graph and then waiting for several inputs. That’d reduce the impact of IO on each run.

praveeny1986 · January 23, 2018, 10:14am

Deepspeech binary prints the inference time like below:

Inference took 26.811s for 21.454s audio file.

Loading model as seen in the output is as below:
Loaded model in 1.793s.

So, I am reporting the inference time that keeps increasing while running multiple runs.

We tried printing execution time.
The execution time for below code in deepspeech.cc increases as number of parallel runs increase using taskset on mutually exclusive cpus.
mPriv->session->Run(
{{ “input_node”, input }, { “input_lengths”, n_frames }},
{“logits”}, {}, &outputs);

lissyx · January 23, 2018, 10:17am

Can you check the frequency of your CPU as well?

praveeny1986 · January 23, 2018, 10:19am

lscpu | grep MHz
CPU MHz: 499.710
CPU max MHz: 3100.0000
CPU min MHz: 400.0000

lissyx · January 23, 2018, 10:34am

Sorry, I should have been more precise: frequency during the inference with one and with two processes

praveeny1986 · January 23, 2018, 10:51am

One process

Two processes

lissyx · January 23, 2018, 10:57am

That’s just a snapshot, one should look at the behavior during the whole process execution. Also, it could be the scheculer / cgroups sharing CPUs in some way?

praveeny1986 · January 23, 2018, 12:23pm

htop shows respective cpus being used to 100%, remaining others ~0-1%. Have tried on Google cloud instance also, same behaviour…

lissyx · January 23, 2018, 12:29pm

But htop does not gives you the frequency. And remember your 4 cores are HT cores, you still have only 2 CPUs.

praveeny1986 · January 23, 2018, 12:34pm

Ok. But if I run sysbench --test=cpu --cpu-max-prime=20000 run, I get the same time in multiple processes running simultaneously on mutually exclusive cpus. Will do more research and get back.

praveeny1986 · January 23, 2018, 12:35pm

@lissyx Is it possible if you can try on your system and verify ?

lissyx · January 23, 2018, 3:14pm

I do have a similar slowdown (11 secs => 18 secs), but is it that surprising?

praveeny1986 · January 23, 2018, 6:21pm

We are trying to figure out how it can be scaled in production. Unless we can do parallel processing it looks like cost is going to be too high. If it can efficiently do prallel processing then inference time can be reduced (by splitting the audio) and hence cheaper and faster value proposition…

lissyx · January 23, 2018, 6:27pm

There would be a lot of other changes to perform if you want to use that in production. Remember it’s still all alpha release. And yes, the model consumes a lot of power, that’s granted. Using GPUs would probably help a lot, but still you would need to design that a bit better, probably using some server bits. Some people contributed that (Python, NodeJS), “deepspeech-server”.

What kind of production usecase are you targetting? What contraints are there?

praveeny1986 · January 23, 2018, 6:34pm

We tried the server stuff using python flask, but still the same longer inference time. May be missing something there, would give it a shot gain and will get back on this. Basically if you see with the current inference time and the amount of resource the model uses, it is impossible to do real-time inference on a single machine with multiple users. The number of machines will be the limitation to how many concurrent inferences can be done at a time.

lissyx · January 23, 2018, 6:37pm

Yes we know all of that, that’s why it’s alpha