DeepSpeech Tensorflow does not use all CPU cores

I could see the CPU only versions/releases of DeepSpeech for RaspBerry PI-3 or ARM64 utilizes only one CPU core for performing the inference, while parallel execution across cores will improve the inference time. Has any one seen this behavior and is there a way to configure tensorflow to utilize all the CPU cores for inference?

All the verifications I could pursue on this confirmed that TensorFlow was properly using all cores. Can you provide more feedback on what makes you think it is not the case ?

Especially, running with env variable TF_CPP_MIN_VLOG_LEVEL=2 should give indications on the inter and intra op parallelism being used

@sranjeet.visteon

$ TF_CPP_MIN_VLOG_LEVEL=2 ./deepspeech --model output_graph.pbmm --alphabet alphabet.txt --audio LDC93S1.wav 2>&1 | grep -i parallelism
2018-10-23 16:41:24.924421: I tensorflow/core/common_runtime/local_device.cc:41] Local device intra op parallelism threads: 4
2018-10-23 16:41:24.925181: I tensorflow/core/common_runtime/process_util.cc:82] Direct session inter op parallelism threads: 4 

A parallel htop also shows multiple threads running and taking CPU.

@lissyx Below is the output from my Jetson TX2 HW

(venv_0.3.0) nvidia@tegra-ubuntu:~/deepspeech/native_client.arm64.cpu.linux$ TF_CPP_MIN_VLOG_LEVEL=2 ./deepspeech --model ./…/models/output_graph.pbmm --alphabet ./…/models/alphabet.txt --audio ./…/wav 2>&1 | grep -i parallelism
2018-10-23 16:52:26.319769: I tensorflow/core/common_runtime/local_device.cc:41] Local device intra op parallelism threads: 6
2018-10-23 16:52:26.320450: I tensorflow/core/common_runtime/process_util.cc:82] Direct session inter op parallelism threads: 6

it shows that 6 threads are available and running htop in parallel I could see all the CPU’s used but only one CPU is heavily utilized at > 70% and rest of the CPU’s are mostly < 20% being used. Also the AVG usage of all CPU’s is ~25%. Is this an expected behavior?

This is more of a TensorFlow-level question, but it does confirm that there is parallelism triggered. Honestly, I think it’s mostly the same level of usage we can see on other hardware. However, if you are running on Jetson, you should rather look into cross-compile for your system with CUDA and leverage the GPU.

@lyssx, Thanks. yes this is more of tensorflow question but wanted to understand from other deepspeech users about my observation. On a RPI3 I could find a similar concern where utilization is always <=100% across cores while 400% is available to be utilized.

I have an usecase to run deepspeech on another HW where only CPU is available to be used and that is the reason for me to try it without enabling the GPU on Jetson TX2.

But all of that actually depends on how much parallelism there is in the model. Just because there are 4 CPUs does not means we can make use of all of them.

Well, currently, we don’t have an as good as we’d like performances on RPi3 and that kind of boards, but they are here as demo purpose.

Running a (slightly modified) version of the model under tflite benchmark_model yield better performances on Pixel 2 device (~2.40B FLOPs per sec) versus RPi3 or LePotato boards (~490M FLOPs per sec)

plain-vs-tflite.zip (249.7 KB)
This contains SVG graph of the model, plain version and tflite one, this latter being different to be able to be ingested by toco and run on the TFLite engine.

@lissyx, is there a plan to release a model and implementation of deepspeech based on tflite any time soon? As I had mentioned before we have an usecase to run on a CPU only system and probably the tflite version might be a better option to evaluate.

Yes, there’s an issue opened on GitHub. There’s an (outdated) WIP tflite branch, I should send PR on it with my fixes, and it does run in the end, at least a 2048-wide model trained for one epoch on LDC93S1. We still have work to do, but it’s starting to work. Except when you want NNAPI on Android, but that’s another story :slight_smile:

Apols to Necro once more but wondering the same alsa has the variable TF_CPP_MIN_VLOG_LEVEL=2 changed? As seems to get no more output.

Prob a tflite question but on a Pi3 only a single core seems to be maxed with very little else.
deepspeech --model deepspeech-0.7.0-models.tflite --scorer deepspeech-0.7.0-models.scorer --audio audio/2830-3980-0043.wav
Which is a shame as

Loading model from file deepspeech-0.7.0-models.tflite
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.0-0-g3fbbca2
Loaded model in 0.00338s.
Loading scorer from files deepspeech-0.7.0-models.scorer
Loaded scorer in 0.000667s.
Running inference.
experience proves this
Inference took 4.327s for 1.975s audio file.

As don’t know about tflite but it seems to be able to do the inference at just over 200% over realtime which is fantastic but makes you wonder if the other 3 cores kicked in more could it not even be much faster?

At a glance it does seem if void SetNumThreads(int num_threads); is set to 1 or for some reason its just selecting 1 if -1

Thats what I get normally and noticed the same with the Pi4
Where a core is maxxed but the others hardly touched.

On the Pi we may have to install this version to get threads going.

If you know what you’re about to do requires an apology, and the correct thing to do is just as easy (create a new thread), why did you still do it?

1 Like

As documented, TFLite runtime does not leverage threads due to TensorFlow-level limitations with our model.

No, you may not.

Because it contained the thread and details I was refering to, the apologies was only because I couldn’t find any reference newer.

So its only TFlite that runs single thread or does all run the same, presume GPU does because of GPU memory allocation, but doest matter due to gpu acceleration.
I had a look at the Mozilla repo vs tensor flow and yeah there are 240 commits on the tensorflow so wasn’t such a great idea.
Shame though as in the examples from the repo posted approx x2.5 perf gain can be made, which if it was possible even a Pi3 might manage faster than realtime.
Pi3 seems approx just less than .5x realtime.
Pi4 speed is cool though and prob would run KWS satellite with the Pi4 as a ‘server’ or maybe something ‘more’.
Presume a single threaded model is not the intension is that part of any roadmap yet? Or is it that a GPU/Accelerated model is the main direction that just doesnt translate that well on tflite?

Did you read my reply? It looks like not.

I did but what you are doing with KWS/VAD doesn’t make sense for many devices hence why I asked if its on a roadmap.

If its going to remain single threaded with the performance you have then isn’t kws/vad better offloaded to satelites and deepspeech used in a server mode.
I read your reply and the KWS/Vad seems a bit of a swerve ball, but with luck have a better time of things.

Those are two completely unrelated problems. I’m working on this threading issue, because TensorFlow now has solutions, but it is in such a state that we can’t integrate it yet and it requires huge amount of debugging.

Sorry, this is complicated topic, it takes time, and I don’t have a lot available.

1 Like