Running multiple inferences in parallel on a single machine?

If I run taskset -c 0,1 deepspeech model audio , it takes 27 seconds.

If I run taskset -c 0,1 deepspeech model audio and taskset -c 2,3 deepspeech model audio in 2 separate shell windows, both inference time becomes 50 seconds for the same audio. Why ?

RAM is not any issue as I have more than 50% RAM available always. OS is ubuntu 16.04.

Why the time increases even when running on mutually exclusive CPUs ?

Thanks

What’s your exact hardware ?

Using cpu version of DeepSpeech.

Hardware details: (lscpu)

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 142
Model name: Intel® Core™ i5-7200U CPU @ 2.50GHz
Stepping: 9
CPU MHz: 498.550
CPU max MHz: 3100.0000
CPU min MHz: 400.0000
BogoMIPS: 5423.88
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

Your CPU has two cores, capable of running 4 threads: https://ark.intel.com/products/95443/Intel-Core-i5-7200U-Processor-3M-Cache-up-to-3_10-GHz

But it’s possible that those are not able to cope with the load. Also one core running will be boosted at 3.10GHz while two cores will be capped at 2.5GHz.

It could also be I/O that are limiting?

Maybe the bottleneck is IO? The graph file size is substantial. You could try modifying the clients to act more like a server, loading the graph and then waiting for several inputs. That’d reduce the impact of IO on each run.

Deepspeech binary prints the inference time like below:

Inference took 26.811s for 21.454s audio file.

Loading model as seen in the output is as below:
Loaded model in 1.793s.

So, I am reporting the inference time that keeps increasing while running multiple runs.

We tried printing execution time.
The execution time for below code in deepspeech.cc increases as number of parallel runs increase using taskset on mutually exclusive cpus.
mPriv->session->Run(
{{ “input_node”, input }, { “input_lengths”, n_frames }},
{“logits”}, {}, &outputs);

Can you check the frequency of your CPU as well?

lscpu | grep MHz
CPU MHz: 499.710
CPU max MHz: 3100.0000
CPU min MHz: 400.0000

Sorry, I should have been more precise: frequency during the inference with one and with two processes

One process

Two processes

That’s just a snapshot, one should look at the behavior during the whole process execution. Also, it could be the scheculer / cgroups sharing CPUs in some way?

htop shows respective cpus being used to 100%, remaining others ~0-1%. Have tried on Google cloud instance also, same behaviour…

But htop does not gives you the frequency. And remember your 4 cores are HT cores, you still have only 2 CPUs.

Ok. But if I run sysbench --test=cpu --cpu-max-prime=20000 run, I get the same time in multiple processes running simultaneously on mutually exclusive cpus. Will do more research and get back.

@lissyx Is it possible if you can try on your system and verify ?

I do have a similar slowdown (11 secs => 18 secs), but is it that surprising?

We are trying to figure out how it can be scaled in production. Unless we can do parallel processing it looks like cost is going to be too high. If it can efficiently do prallel processing then inference time can be reduced (by splitting the audio) and hence cheaper and faster value proposition…

There would be a lot of other changes to perform if you want to use that in production. Remember it’s still all alpha release. And yes, the model consumes a lot of power, that’s granted. Using GPUs would probably help a lot, but still you would need to design that a bit better, probably using some server bits. Some people contributed that (Python, NodeJS), “deepspeech-server”.

What kind of production usecase are you targetting? What contraints are there?

We tried the server stuff using python flask, but still the same longer inference time. May be missing something there, would give it a shot gain and will get back on this. Basically if you see with the current inference time and the amount of resource the model uses, it is impossible to do real-time inference on a single machine with multiple users. The number of machines will be the limitation to how many concurrent inferences can be done at a time.

Yes we know all of that, that’s why it’s alpha :slight_smile: