How can I use intel-tensorflow for inference of deepspeech?

Hi everybody,

I’m implementing sort of “google assistant” just working offline for purpose of automation of tasks on my computer. So deep speech is to recognize simple commands like : “open terminal”, “file an issue to jira” etc. And problem is CPU inference is quite slow (~4 s on my CPU). So I would like to use Intel flavor of tensorflow (https://pypi.org/project/intel-tensorflow/) as it is highly optimized for CPU. How can I do that? If not possible then where should I look into DeepSpeech code to introduce needed modifications? As I’m running Linux then env var like LD_PRELOAD are available so I could force OS to load intel tensorflow rather than tensorflow that is used by DeepSpeech?

Some more details:

  • CPU : Intel® Core™ i5-8350U CPU @ 1.70GHz
  • OS: Fedora Linux 29

Due to using a CPU, have you tried using the tflite model?

You would have to rebuild everything which can be quite a headache. If this is just for yourself, maybe not worth it. If this is going to be a product for a client. Maybe.

Search this forum as building for special environments comes up quite a lot. If you have read some your questions will be more specific and it will be easiert to answer them. Start with searching for Mozillas Tensorflow as it doesn’t use the vanilla one. You’ll probably have to modify Intel’s TF accordingly.

no, we statically link tensorflow (and we need a few patches at build time)

If you change the tensorflow git submodule to point to intel you should be able to do, but you will have to hack.

4s of inference for how long of audio ?

@lissyx , audio is 5 seconds long


This CPU has AVX2, our binaries are built only with AVX (for broader compatibility reasons).

Depending on CPU, you can get up to 40% speedup (that we measured a long time ago, though, so the figure might be different now) with AVX2.

That, and using TFLite runtime (please follow carefully the very specific details) with AVX2 + TFLITE_WITH_RUY_GEMV flag can also improve even further things.

Right, so it’s faster than realtime, which is our goal. Please explain how that is a problem?

From the user perspective it would be better to wait less time like 2s rather than 4s. So I want to explore options of making inference faster on CPU.

  1. Can you point me to location where you link statically tensorflow ?
  2. Could you please explain/ point me to what this patches you apply on tensorflow are about?

Have you read the docs to rebuild ?

native_client/BUILD

Please help yourself and git diff betweeen our git fork and upstream.

It’s hard to help you there without knowing about your implementation.

If you are using Streaming interface, and you have a CPU able to be faster than realtime, I don’t see how you can have any waiting time.

This is based on TensorFlow r2.3

Have you tried streaming recognition. Maybe that already solves your performance issue?

Our history should be clean enough on top of r2.3: https://github.com/mozilla/tensorflow/commits/23ad988fcde60fb01f9533e95004bbc4877a9143

@othiele I haven’t try . I will look to that

My implementation is :

  1. I record an audio sample (using arecord) and store in file (max 5s)
  2. I use this sample as input to DeepSpeech inferered to get transcription (~4 s)
  3. I perform action based on transcription recieved

Right, so as advised, use the streaming interface, you should already have decent perfs. If it’s not enough, you can rebuild with AVX2 (but if you need to deploy this, it will limit your target devices). Then later you can look into intel-tensorflow.

Also, please be aware I don’t know the extents of their fork, it’s not impossible they are focusing on improving training / adding new training devices (like Xeon Phi).

@lissyx Thanks for suggesstions.

In tensorflow repo (https://github.com/tensorflow/tensorflow) there are number of builds mentioned in continuous integration of TF. One of them is: Linux CPU with Intel oneAPI Deep Neural Network Library (oneDNN) . So this is what I’m looking for. So this means that some Intel optimizations are in This repo. Which may mean that
it is matter of building of TF with support for Intel oneDNN or that it is already present in DeepSpeech project. It is possible via env var DNNL_VERBOSE=1 to see diagnostic output of IntelOneDNN if it is used. I do not see it for DS 8.2 release, so It is not used.
So It is probable that it is matter of adjusting compilation flags of Tensorflow to benefit from some Intel oneDNN optimizations. Shortly speaking there is no other Intel TF repo than https://github.com/tensorflow/tensorflow.

  1. Can you point me to code of DeepSpeech where building of Tensrflow is performed ?

I actually have some insight into Intel DL frameworks. You are correct saying that optimization focus is on Server’s CPU , but focus is instruction sets rather than specific CPU models and as modern laptops with Intel CPU are supporting AVX2 then it will be beneficial for desktop clients to use inference that includes Intel CPU optimizations.

DL frameworks where intel contribute are using underneath oneDNN (https://github.com/oneapi-src/oneDNN) library. There is a dynamic dispatcher there e.g. In runtime based on CPU caps there is a decision which implementation to run. For example if AVX2 is present then for selected algorithm (convolution, Inner product, softmax etc.) the most suitable implementation (AVX2 code writen in JIT assembly) is chosen. JIT implementations are very efficient and they are implemented for architectures supporting instruction sets: SSE4.1 and beyond (SSE4.1, AVX,AVX2,AVX512, AVX512VNNI…) . For older / different CPU’s reference implementations of algorithms are used (plain C++ code).

I’m not sure I understand your questions. Build steps are documented, I linked you to BUILD file that defines the building of the libraries, so it includes the tensorflow dependencies, you have already everything …

@lissyx Thanks, I learned alot during this disscusion. I will analyze recived info and comback with info if TF with IntelOneDNN makes DS inference faster on my platform

you should really first do streaming, then try just enabling AVX2, your tensorflow rebuild is much more invasive

it’s not even clear that the most intensive calls of our graph will be optimized, you should really:

  • profile
  • identify the bottlenecks
  • ensure there is actual optimized oneDNN version for those