GPU much slower

Am I missing something obvious, clearly I am. Perhaps I’m making some fundamental mistake in understanding, but I expected the GPU version to run an inference far faster than a CPU version. This is roughly 3 times as long as the CPU version took for the same inference. Perhaps it’s not really engaging the GPUs? Is there a way to verify?

mail_reknew@deepdictation-1-gpu ~]$ deepspeech models/output_graph.pb data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
2018-02-22 23:01:22.906902: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-22 23:01:23.602419: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least 
one NUMA node, so returning NUMA node zero
2018-02-22 23:01:23.602852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-02-22 23:01:23.602891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bu
s id: 0000:00:04.0, compute capability: 6.0)
Loaded model in 4.769s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 11.281s.
Running inference.
my mom this is bread i am speaking as clearly as possible and of slowly as possible i hope you get this
Inference took 17.003s for 10.000s audio file.

To partially answer my own question I guess this is just a case where parallelization for a small task doesn’t make sense? The setup involved of copying data to GPU space, etc. is overwhelming any benefit. Right?

Is there a way to split out the timings to see the actual benefit of the GPU for a small test like this?

Use environment variable TF_CPP_MIN_LOG_LEVEL=1 or above values, this will get you detailed informations of the computations, you’ll be able to know what is running on the GPU, and what is not.

How many GPUs are there? This shows only one.

This happens to me when deepspeech with GPU is run on one sample only. When it runs as a part of a node.js server or using a modified python script for batch processing, first call takes a long time but all consecutive inferences take about 30-40% time of running the CPU version.

E.g. while after the first request warm-up a 5 second audio takes about 5 seconds on CPU, it takes about 2 seconds on GPU.

Yes, that’s the other alternative. One easy way to verify this is to check with the latest DeepSpeech artifacts from TaskCluster, and use the mmap()'d version of the model: it requires to be converted from the default output_graph.pb, using https://index.taskcluster.net/v1/task/project.deepspeech.tensorflow.pip.r1.5.cpu/artifacts/public/convert_graphdef_memmapped_format

This also lowers a lot the heap memory requirements at runtime.

But still, 17secs for 10 secs of audio, even if you take out 5-6 secs for models loading / parsing, on a P100, I would have expected much faster inference.

I guess I am not sure how to read the results. You say there is only one GPU. There is one Tesla P100 card with presumably 3,584 cores. Is that what you mean by one GPU?

You wouldn’t happen to be able to post a sample of that batch processing Python script would you?

I don’t see any debugging or additional timing info when exporting that flag. Should I be looking in a log file or does it come to STDOUT?

Here’s what I did:

$ export TF_CPP_MIN_LOG_LEVEL=1
$ echo $TF_CPP_MIN_LOG_LEVEL 
1
$ deepspeech models/output_graph.pb data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
...

Is there some trick to creating a usable converted graph? I did:

$ ./convert_graphdef_memmapped_format --in_graph="models/output_graph.pb" --out_graph="models/output_graph.mmmapped_graph"
2018-02-24 20:42:34.555798: I tensorflow/contrib/util/convert_graphdef_memmapped_format_lib.cc:171] Converted 10 nodes
[mail_reknew@deepdictation-1-gpu ~]$ deepspeech models/output_graph.mmapped_graph data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.mmapped_graph
Data loss: Can't parse models/output_graph.mmapped_graph as binary proto
Loaded model in 0.467s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 2.421s.
Running inference.
Segmentation fault

Make sure you are using binaries from TaskCluster, master branch: https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu

The env variable might be also TF_CPP_MIN_VLOG_LEVEL, try values above 2, you should have lots of output on stderr, mentionning which device the ops are running :).

Also, please export the mmap format graph to output_graph.pbmm, for consistency with our codebase. The order of arguments has changed, the wav file should be the latest one.

You can play with batching with deepspeech binary from native_client.tar.xz: pass a directory instead of just one WAV file, and add -t as latest argument.