CPU bottleneck when inferencing with GPU?

alex_cannan · June 23, 2020, 2:02pm

Hi, I am trying to stand up a deepspeech server to transcribe audio files for me, but performance isn’t as good after deploying to my server environment and I’m thinking it’s a CPU bottleneck. I’m running the server with python using the v0.6.1 bindings.

My dev machine:
CPU: AMD Ryzen 3700x 8-core
GPU: NVIDIA RTX 2070 Super
Memory: 32 GB
OS: Windows 10
Realtime: 0.334

My server machine:
CPU: AMD FX-8320 8-core
GPU: NVIDIA RTX 2080 Ti
Memory: 32 GB
OS: Ubuntu 19.10
Realtime: 0.526

Can this difference in x realtime be attributed to the difference in CPUs? Also, are these speeds reasonable (should I expect better)? Happy to provide more information if requested.

lissyx · June 23, 2020, 2:05pm

Those values are unclear: are you respectively three times and two times faster than realtime? Or three times and two times slower than realtime?

In our definition, realtime < 1.0 means slower.

How can you say this ? You are comparing different CPUs, with different memory layouts, different cache speeds, different motherboard, different GPUs, different operatings system and thus drivers, not to mention the different behaviors of different python (which versions?) under those OS.

That seems way too much factor to conclude anything.

alex_cannan · June 23, 2020, 2:29pm

Sorry about that, I see the confusion. Those values indicate faster than realtime, so processing_time / audio_duration.

I agree there is, sadly I don’t have the server machine present so I can’t do any a/b testing. I’m just trying to get an idea of how CPU-reliant the deepspeech-gpu package is, and what some of the better realtime coefficients are people have achieved.

Both systems have similar conda environments, running python 3.7.0 with deepspeech-gpu 0.6.1. I’ve just noticed my server environment is running the full tensorflow 1.14.0 package, I’ll see if I can create a new one to use only deepspeech-gpu.

lissyx · June 23, 2020, 2:32pm

Yeah, but there are so many differences because of the OS.

You’d need to profile precisely your application, it will also depend on your workload.

Again, realtime factor on GPUs depends on how you feed GPUs.

lissyx · June 23, 2020, 3:06pm

Trying to feed GPUs at max, some contributor reported behavior where the TensorFlow-level tensor code would at some point stall in performances because of trashing the CPU caches, for example.

Topic		Replies	Views
Inference time run speeds DeepSpeech	0	1391	March 27, 2018
Inference time on V100 seems slow DeepSpeech	13	3173	March 13, 2018
Capacity need for real time DeepSpeech	3	980	March 24, 2019
GPU much slower DeepSpeech	9	1894	February 25, 2018
DeepSpeech benchmarking profile DeepSpeech	5	1341	December 9, 2019

CPU bottleneck when inferencing with GPU?

Related topics