I have a server with a double CPU socket, and I have noticed that it takes a long time to make an inference (I use the GPU to train and do the transcriptions). Will the dual CPU socket affect? I haven’t found much information about it.
I have an rtx quadro 6000. It takes 1.1 seconds per second of audio. I have checked the gpu load when making the inference and yes, it is in use.
I have tried with an rtx 2070 and a processor (a socket), and the inference times are very low.
Loading model from file ./train-data/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.0-0-g3db7a99
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-07-26 12:41:59.666556: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-07-26 12:42:01.254405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Quadro RTX 6000 major: 7 minor: 5 memoryClockRate(GHz): 1.815
pciBusID: 0000:18:00.0
totalMemory: 23.72GiB freeMemory: 23.57GiB
2019-07-26 12:42:01.254450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-26 12:42:02.337865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-26 12:42:02.337910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-07-26 12:42:02.337920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-07-26 12:42:02.338492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 24077 MB memory) → physical GPU (device: 0, name: Quadro RTX 6000, pci bus id: 0000:18:00.0, compute capability: 7.5)
2019-07-26 12:42:02.546578: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-07-26 12:42:02.546632: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-07-26 12:42:02.546645: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-07-26 12:42:02.547491: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 2.89s.
Loading language model from files lm.binary trie
Loaded language model in 0.00803s.
Running inference.
2019-07-26 12:42:05.967733: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
Inference took 4.981s for 4.320s audio file.
((slow to reply) [NOT PROVIDING SUPPORT])
Strange. Do you know the hw layout of the PCIe ? How is it connected to the CPU ?
Someone in another thread reported huge performance loss because of funny PCIe configuration.
I have the motherboard layout with their respective PCIe connectors. The problem is that I don’t know which PCIe port the GPU is on. mtb.pdf (255,5 KB)
I am not very familiar with this type of motherboard, so I am struck by the following: JPCIE2 (CPU1 SLOT2 PCI-E 3.0 x16) | JPCIE6 (CPU2 SLOT6 PCI-E 3.0 x16)
((slow to reply) [NOT PROVIDING SUPPORT])
Thanks! The TensorFlow log mentions a PCI slot, so you should be able to know a bit more.
Have you seen on the PDF page with the motherboard layout that PCIe slots are linked to a CPU ? Is it possible you have code running on one CPU and the GPU being attached to the other ?
((slow to reply) [NOT PROVIDING SUPPORT])
((slow to reply) [NOT PROVIDING SUPPORT])
Weird, it shows the GPUs goes to full power. Really weird. Do you think you can try and rebuild libdeepspeech.so ? I’m curious of what happens depending on the CUDA compute capabilities you select at build time.
We default with only 3.5 those days, your GPU is 7.5. Is it possible it is more than suboptimal ?
I tried using an mmap model (.pbmm) to read the data directly from the disk, and the inference speed increased. The inference took 2,854 seconds for audio 1,410 seconds (with GPU).
Could it be that the motherboard architecture is affecting? From what I understand, the motherboard that has a double socket occupies dedicated RAM slots (RAM A with CPU A), maybe the data is in RAM B and processed in CPU A?
((slow to reply) [NOT PROVIDING SUPPORT])
It could, but still, 2.8s for a 1.4s audio is way too much for that GPU