Inference time on V100 seems slow

Hi,

After much wrangling, I was able to get DeepSpeech working on the new Amazon V100 Nvidia instances (p3.2xlarge). The inference seems quite slow though - I seem to get .95x - 1.25x real-time in order to do inference on V100…i.e. about 2 seconds for inference on a 2 second audio clip. This is a top card, and this seems much slower than what others seem to be reporting (closer to 0.3x-0.4x).

For comparison, the CPU seems to take 2x to 2.5x to do the same inference. Really surprised the V100 isn’t performing better and I’m wondering if I’m doing something suboptimal.

Getting the CPU inference going was fine - but the GPU inference was very frustrating because of mismatches between tensorflow, deepspeech, cuda, and cudnn. In the end, the only config I could get running was:
pip install 'tensorflow-gpu==1.5.0’
pip install deepspeech-gpu (the pypi package doesn’t work - so I used the artifact here: https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu)
manually install Cuda 9.0
manually install cudnn 7.0.5

Is anyone getting faster performance on the amazon v100? Or is .95x the best I can hope for?

thx

Ok - update - I converted to mmap format using native client tool located at (https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu)

The initial inference is about the same - but subsequent inferences are now much better at about 0.23x real-time running on the V100. Does this seem right to folks?

0.23x seems much better, but I’m surprised the first inference would still be so slow. Maybe some setup-stuff, loading the model on the GPU ?

I remember there was a specific patch to apply for those Volta GPUs as well. We do not build with them, I don’t know if it as an impact at runtime or if it is enough to just setup all the patches when you install the inference tooling.

Patches are at https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal (for ubuntu/16.04/amd64), and the first one is explicitely to improve performances on Volta GPUs.

1 Like

I’ve rented the same instance, and I’m running some experiments. This is on Ubuntu 16.04 provided by Amazon, and installed nvidia-384 driver. Then hand-installed CUDA 9.0 + CuDNN v7.

The first run took 30-45 secs before (expected) failure. I don’t really know why. Subsequent runs seems nicer:

ubuntu@ip-172-31-32-91:~/ds/gpu$ time ./deepspeech ../models/output_graph.pb ../models/alphabet.txt ../audio/ -t 
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-03-07 14:20:41.678897: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-07 14:20:41.786354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-07 14:20:41.786738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.35GiB
2018-03-07 14:20:41.786766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Running on directory ../audio/
> ../audio//8455-210777-0068.wav
your powr is sufficient i said
cpu_time_overall=2.45969 cpu_time_mfcc=0.00468 cpu_time_infer=2.45501
> ../audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=0.37646 cpu_time_mfcc=0.00452 cpu_time_infer=0.37194
> ../audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=0.28520 cpu_time_mfcc=0.00326 cpu_time_infer=0.28194

real	0m4.124s
user	0m2.704s
sys	0m1.524s

Runs with mmap() are even nicer:

ubuntu@ip-172-31-32-91:~/ds/gpu$ time ./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t 
2018-03-07 14:21:25.242678: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-07 14:21:25.342985: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-07 14:21:25.343375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.35GiB
2018-03-07 14:21:25.343403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Running on directory ../audio/
> ../audio//8455-210777-0068.wav
your powr is sufficient i said
cpu_time_overall=0.71324 cpu_time_mfcc=0.00432 cpu_time_infer=0.70892
> ../audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=0.45870 cpu_time_mfcc=0.00454 cpu_time_infer=0.45416
> ../audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=0.33660 cpu_time_mfcc=0.00332 cpu_time_infer=0.33327

real	0m2.201s
user	0m1.588s
sys	0m0.732s

Speaking in term of realtime factor, after a few runs, I get those stable (low variation) values:

file audio length cpu_infer_time rt factor
…/audio//8455-210777-0068.wav 1,975 0,70892 0,358946835443038
…/audio//4507-16021-0012.wav 2,735 0,45416 0,166054844606947
…/audio//2830-3980-0043.wav 2,59 0,33327 0,128675675675676

This is with CUDA 9.0 / CuDNN v7.

Aside, you don’t have to scratch your head: no need for tensorflow-gpu if you don’t train. What was problematic with CUDA and CuDNN ? We assume that people wanting to run GPU-enabled version have already properly configured their system.

After applying patches 1 and 2 from https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal I’m getting mostly the same results.

@mark My guess for now is that our model does not benefit from the extra power of V100 on the small data we ingest. I don’t see any reference to the audio files you used to perform your tests. What’s their length ? Testing with a 5.55secs audio file I’m getting ~0.16x, using .pbmm, and ~0.155x using .pb, and when doing multiple inferences.

Doing just one-shot on the same file gets me ~0.23x with .pbmm, so close to your results.

These patches on Volta were for 9.1 I believe, not 9.0. The package for tensorflow wouldn’t work with 9.1 - but we could probably build TF against that to see if that patch improves perf.

Good that you’re getting same ballpark numbers now.

Fwiw, I always noticed that the first tensorflow run on an AWS P3 takes about a minute to get going. This has been true for all my TF projects, not just deepspeech. Not sure why, but my guess is how the instance connects to the actual underlying hardware.

good to know - it was the 3 audio files included in the 0.1.1 release. So looks like basically the same result. There might be slight speed increase in your case since you’re using the option to point at directory instead of individual file. My version of deep speech didn’t seem to support that.

Ok, total noob question - but why is tensorflow-gpu not needed for GPU inference (not just training)? When I tried just tensorflow package I seemed to be getting 2x real-time on the CPU.

1 Like

Because tensorflow-gpu has nothing to do with deepspeech-gpu :slight_smile:

Do you mean that deepspeech-gpu is completely statically linked and doesn’t require any shared libs from TF?

1 Like

Yes, that’s it, everything is there.