Dual CPU socket inference

Hello,

I have a server with a double CPU socket, and I have noticed that it takes a long time to make an inference (I use the GPU to train and do the transcriptions). Will the dual CPU socket affect? I haven’t found much information about it.

I have an rtx quadro 6000. It takes 1.1 seconds per second of audio. I have checked the gpu load when making the inference and yes, it is in use.

I have tried with an rtx 2070 and a processor (a socket), and the inference times are very low.

Do you have information about it? Thank you.

I don’t know the specs of your CPU but that is a very old GPU which supports a very old version of CUDA. So the GPU is probably a bigger culprit here.

Without knowing the exact model of your CPU it’s hard to tell. TensorFlow should be able to leverage multi-core, though.

RTX Quadro 6000 ? Are you sure about that ? Because there were also Quadro 6000 produced.

And they don’t really have the same performances: https://gpu.userbenchmark.com/Compare/Nvidia-Quadro-RTX-6000-vs-Nvidia-Quadro-6000/m736712vsm7657

Based on your 1.1 real time factor, it looks more like a Quadro 6000 than a RTX. Your test with RTX 2070 would confirm that.

CPU Specifications:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel® Xeon® Bronze 3106 CPU @ 1.70GHz
Stepping: 4
CPU MHz: 800.062
CPU max MHz: 1700.0000
CPU min MHz: 800.0000
BogoMIPS: 3400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 11264K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15

Yes, it is the Quadro RTX 6000. In training it is much faster than the nvidia 2070, but in the inference it is very slow.

Do you have figures ? Can you share console output ?

sure:
Loading model from file ./train-data/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.0-0-g3db7a99
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-07-26 12:41:59.666556: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-07-26 12:42:01.254405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Quadro RTX 6000 major: 7 minor: 5 memoryClockRate(GHz): 1.815
pciBusID: 0000:18:00.0
totalMemory: 23.72GiB freeMemory: 23.57GiB
2019-07-26 12:42:01.254450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-26 12:42:02.337865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-26 12:42:02.337910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-07-26 12:42:02.337920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-07-26 12:42:02.338492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 24077 MB memory) -> physical GPU (device: 0, name: Quadro RTX 6000, pci bus id: 0000:18:00.0, compute capability: 7.5)
2019-07-26 12:42:02.546578: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-07-26 12:42:02.546632: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-07-26 12:42:02.546645: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-07-26 12:42:02.547491: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 2.89s.
Loading language model from files lm.binary trie
Loaded language model in 0.00803s.
Running inference.
2019-07-26 12:42:05.967733: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
que
Inference took 4.981s for 4.320s audio file.

Strange. Do you know the hw layout of the PCIe ? How is it connected to the CPU ?
Someone in another thread reported huge performance loss because of funny PCIe configuration.

I have the motherboard layout with their respective PCIe connectors. The problem is that I don’t know which PCIe port the GPU is on.
mtb.pdf (255,5 KB)

I am not very familiar with this type of motherboard, so I am struck by the following: JPCIE2 (CPU1 SLOT2 PCI-E 3.0 x16) | JPCIE6 (CPU2 SLOT6 PCI-E 3.0 x16)

Motherboard specifications:
https://www.supermicro.org.cn/en/products/motherboard/X11DPG-QT

Thanks! The TensorFlow log mentions a PCI slot, so you should be able to know a bit more.

Have you seen on the PDF page with the motherboard layout that PCIe slots are linked to a CPU ? Is it possible you have code running on one CPU and the GPU being attached to the other ?

FTR this is the thread I was thinking about: Version 0.5 pefomance issues with GPUs at lower link speed

@ARCS Also, could you check GPU usage with nvidia-smi dmon ?

It could be this case, I will investigate more about it.

When I run a test:

  gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
  Idx     W     C     C     %     %     %     %   MHz   MHz
    0    14    30     -     0     0     0     0   405   300
    0    14    30     -     0     0     0     0   405   300
    0    14    30     -     0     0     0     0   405   300
    0    14    30     -     0     0     0     0   405   300
    0    24    31     -     0     0     0     0  6500  1620
    0    43    31     -     2     0     0     0  6500  1620
    0    43    31     -     0     0     0     0  6500  1620
    0    43    31     -     0     0     0     0  6500  1620
    0    43    31     -     0     0     0     0  6500  1620
    0    43    31     -     0     0     0     0  6500  1620
    0    58    32     -     7     3     0     0  7000  1950
    0    57    32     -    32     5     0     0  7000  1950
    0    39    31     -     0     0     0     0  5000   525
    0    20    31     -     0     0     0     0   810   375
    0    18    30     -     0     0     0     0   810   300
    0    14    30     -     0     0     0     0   405   300
    0    14    30     -     0     0     0     0   405   300
    0    14    30     -     0     0     0     0   405   300

and the output: Inference took 4.368s for 1.410s audio file.

Thanks for the info.
The versions I work with are the following:

TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.0-0-g3db7a99

Weird, it shows the GPUs goes to full power. Really weird. Do you think you can try and rebuild libdeepspeech.so ? I’m curious of what happens depending on the CUDA compute capabilities you select at build time.

We default with only 3.5 those days, your GPU is 7.5. Is it possible it is more than suboptimal ?

I tried using an mmap model (.pbmm) to read the data directly from the disk, and the inference speed increased. The inference took 2,854 seconds for audio 1,410 seconds (with GPU).

Could it be that the motherboard architecture is affecting? From what I understand, the motherboard that has a double socket occupies dedicated RAM slots (RAM A with CPU A), maybe the data is in RAM B and processed in CPU A?

It could, but still, 2.8s for a 1.4s audio is way too much for that GPU :confused: