CPU bottleneck when inferencing with GPU?

Hi, I am trying to stand up a deepspeech server to transcribe audio files for me, but performance isn’t as good after deploying to my server environment and I’m thinking it’s a CPU bottleneck. I’m running the server with python using the v0.6.1 bindings.

My dev machine:
CPU: AMD Ryzen 3700x 8-core
GPU: NVIDIA RTX 2070 Super
Memory: 32 GB
OS: Windows 10
Realtime: 0.334

My server machine:
CPU: AMD FX-8320 8-core
GPU: NVIDIA RTX 2080 Ti
Memory: 32 GB
OS: Ubuntu 19.10
Realtime: 0.526

Can this difference in x realtime be attributed to the difference in CPUs? Also, are these speeds reasonable (should I expect better)? Happy to provide more information if requested.

Those values are unclear: are you respectively three times and two times faster than realtime? Or three times and two times slower than realtime?

In our definition, realtime < 1.0 means slower.

How can you say this ? You are comparing different CPUs, with different memory layouts, different cache speeds, different motherboard, different GPUs, different operatings system and thus drivers, not to mention the different behaviors of different python (which versions?) under those OS.

That seems way too much factor to conclude anything.

Sorry about that, I see the confusion. Those values indicate faster than realtime, so processing_time / audio_duration.

I agree there is, sadly I don’t have the server machine present so I can’t do any a/b testing. I’m just trying to get an idea of how CPU-reliant the deepspeech-gpu package is, and what some of the better realtime coefficients are people have achieved.

Both systems have similar conda environments, running python 3.7.0 with deepspeech-gpu 0.6.1. I’ve just noticed my server environment is running the full tensorflow 1.14.0 package, I’ll see if I can create a new one to use only deepspeech-gpu.

Yeah, but there are so many differences because of the OS.

You’d need to profile precisely your application, it will also depend on your workload.

Again, realtime factor on GPUs depends on how you feed GPUs.

Trying to feed GPUs at max, some contributor reported behavior where the TensorFlow-level tensor code would at some point stall in performances because of trashing the CPU caches, for example.