GPU much slower

btofel · February 22, 2018, 11:06pm

Am I missing something obvious, clearly I am. Perhaps I’m making some fundamental mistake in understanding, but I expected the GPU version to run an inference far faster than a CPU version. This is roughly 3 times as long as the CPU version took for the same inference. Perhaps it’s not really engaging the GPUs? Is there a way to verify?

mail_reknew@deepdictation-1-gpu ~]$ deepspeech models/output_graph.pb data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
2018-02-22 23:01:22.906902: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-22 23:01:23.602419: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least 
one NUMA node, so returning NUMA node zero
2018-02-22 23:01:23.602852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-02-22 23:01:23.602891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bu
s id: 0000:00:04.0, compute capability: 6.0)
Loaded model in 4.769s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 11.281s.
Running inference.
my mom this is bread i am speaking as clearly as possible and of slowly as possible i hope you get this
Inference took 17.003s for 10.000s audio file.

btofel · February 23, 2018, 12:43am

To partially answer my own question I guess this is just a case where parallelization for a small task doesn’t make sense? The setup involved of copying data to GPU space, etc. is overwhelming any benefit. Right?

Is there a way to split out the timings to see the actual benefit of the GPU for a small test like this?

lissyx · February 23, 2018, 7:39am

Use environment variable TF_CPP_MIN_LOG_LEVEL=1 or above values, this will get you detailed informations of the computations, you’ll be able to know what is running on the GPU, and what is not.

How many GPUs are there? This shows only one.

yv001 · February 23, 2018, 8:19am

This happens to me when deepspeech with GPU is run on one sample only. When it runs as a part of a node.js server or using a modified python script for batch processing, first call takes a long time but all consecutive inferences take about 30-40% time of running the CPU version.

E.g. while after the first request warm-up a 5 second audio takes about 5 seconds on CPU, it takes about 2 seconds on GPU.

lissyx · February 23, 2018, 8:41am

Yes, that’s the other alternative. One easy way to verify this is to check with the latest DeepSpeech artifacts from TaskCluster, and use the mmap()'d version of the model: it requires to be converted from the default output_graph.pb, using https://index.taskcluster.net/v1/task/project.deepspeech.tensorflow.pip.r1.5.cpu/artifacts/public/convert_graphdef_memmapped_format

This also lowers a lot the heap memory requirements at runtime.

But still, 17secs for 10 secs of audio, even if you take out 5-6 secs for models loading / parsing, on a P100, I would have expected much faster inference.

btofel · February 24, 2018, 7:58pm

I guess I am not sure how to read the results. You say there is only one GPU. There is one Tesla P100 card with presumably 3,584 cores. Is that what you mean by one GPU?

btofel · February 24, 2018, 8:19pm

You wouldn’t happen to be able to post a sample of that batch processing Python script would you?

btofel · February 24, 2018, 8:22pm

I don’t see any debugging or additional timing info when exporting that flag. Should I be looking in a log file or does it come to STDOUT?

Here’s what I did:

$ export TF_CPP_MIN_LOG_LEVEL=1
$ echo $TF_CPP_MIN_LOG_LEVEL 
1
$ deepspeech models/output_graph.pb data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
...

btofel · February 24, 2018, 8:43pm

Is there some trick to creating a usable converted graph? I did:

$ ./convert_graphdef_memmapped_format --in_graph="models/output_graph.pb" --out_graph="models/output_graph.mmmapped_graph"
2018-02-24 20:42:34.555798: I tensorflow/contrib/util/convert_graphdef_memmapped_format_lib.cc:171] Converted 10 nodes
[mail_reknew@deepdictation-1-gpu ~]$ deepspeech models/output_graph.mmapped_graph data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.mmapped_graph
Data loss: Can't parse models/output_graph.mmapped_graph as binary proto
Loaded model in 0.467s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 2.421s.
Running inference.
Segmentation fault

lissyx · February 25, 2018, 3:52pm

Make sure you are using binaries from TaskCluster, master branch: https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu

The env variable might be also TF_CPP_MIN_VLOG_LEVEL, try values above 2, you should have lots of output on stderr, mentionning which device the ops are running :).

Also, please export the mmap format graph to output_graph.pbmm, for consistency with our codebase. The order of arguments has changed, the wav file should be the latest one.

You can play with batching with deepspeech binary from native_client.tar.xz: pass a directory instead of just one WAV file, and add -t as latest argument.

Topic		Replies	Views
Inference time on V100 seems slow DeepSpeech	13	3154	March 13, 2018
DeepSpeech benchmarking / Shorten inference time DeepSpeech	16	5728	February 14, 2018
Installing DeepSpeech GPU version - do i need to install Tensorflow-GPU also? DeepSpeech	9	872	April 6, 2021
Using deepspeech-rs with GPU DeepSpeech	12	985	February 20, 2020
Jetson TX1 deepspeech inference performance is not real time DeepSpeech	3	868	September 11, 2018

GPU much slower

Related topics