I’m implementing sort of “google assistant” just working offline for purpose of automation of tasks on my computer. So deep speech is to recognize simple commands like : “open terminal”, “file an issue to jira” etc. And problem is CPU inference is quite slow (~4 s on my CPU). So I would like to use Intel flavor of tensorflow (https://pypi.org/project/intel-tensorflow/) as it is highly optimized for CPU. How can I do that? If not possible then where should I look into DeepSpeech code to introduce needed modifications? As I’m running Linux then env var like LD_PRELOAD are available so I could force OS to load intel tensorflow rather than tensorflow that is used by DeepSpeech?
Due to using a CPU, have you tried using the tflite model?
You would have to rebuild everything which can be quite a headache. If this is just for yourself, maybe not worth it. If this is going to be a product for a client. Maybe.
Search this forum as building for special environments comes up quite a lot. If you have read some your questions will be more specific and it will be easiert to answer them. Start with searching for Mozillas Tensorflow as it doesn’t use the vanilla one. You’ll probably have to modify Intel’s TF accordingly.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
3
no, we statically link tensorflow (and we need a few patches at build time)
If you change the tensorflow git submodule to point to intel you should be able to do, but you will have to hack.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
Depending on CPU, you can get up to 40% speedup (that we measured a long time ago, though, so the figure might be different now) with AVX2.
That, and using TFLite runtime (please follow carefully the very specific details) with AVX2 + TFLITE_WITH_RUY_GEMV flag can also improve even further things.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
Right, so it’s faster than realtime, which is our goal. Please explain how that is a problem?
I record an audio sample (using arecord) and store in file (max 5s)
I use this sample as input to DeepSpeech inferered to get transcription (~4 s)
I perform action based on transcription recieved
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
15
Right, so as advised, use the streaming interface, you should already have decent perfs. If it’s not enough, you can rebuild with AVX2 (but if you need to deploy this, it will limit your target devices). Then later you can look into intel-tensorflow.
Also, please be aware I don’t know the extents of their fork, it’s not impossible they are focusing on improving training / adding new training devices (like Xeon Phi).
In tensorflow repo (https://github.com/tensorflow/tensorflow) there are number of builds mentioned in continuous integration of TF. One of them is: Linux CPU with Intel oneAPI Deep Neural Network Library (oneDNN) . So this is what I’m looking for. So this means that some Intel optimizations are in This repo. Which may mean that
it is matter of building of TF with support for Intel oneDNN or that it is already present in DeepSpeech project. It is possible via env var DNNL_VERBOSE=1 to see diagnostic output of IntelOneDNN if it is used. I do not see it for DS 8.2 release, so It is not used.
So It is probable that it is matter of adjusting compilation flags of Tensorflow to benefit from some Intel oneDNN optimizations. Shortly speaking there is no other Intel TF repo than https://github.com/tensorflow/tensorflow.
Can you point me to code of DeepSpeech where building of Tensrflow is performed ?
I actually have some insight into Intel DL frameworks. You are correct saying that optimization focus is on Server’s CPU , but focus is instruction sets rather than specific CPU models and as modern laptops with Intel CPU are supporting AVX2 then it will be beneficial for desktop clients to use inference that includes Intel CPU optimizations.
DL frameworks where intel contribute are using underneath oneDNN (https://github.com/oneapi-src/oneDNN) library. There is a dynamic dispatcher there e.g. In runtime based on CPU caps there is a decision which implementation to run. For example if AVX2 is present then for selected algorithm (convolution, Inner product, softmax etc.) the most suitable implementation (AVX2 code writen in JIT assembly) is chosen. JIT implementations are very efficient and they are implemented for architectures supporting instruction sets: SSE4.1 and beyond (SSE4.1, AVX,AVX2,AVX512, AVX512VNNI…) . For older / different CPU’s reference implementations of algorithms are used (plain C++ code).
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
17
I’m not sure I understand your questions. Build steps are documented, I linked you to BUILD file that defines the building of the libraries, so it includes the tensorflow dependencies, you have already everything …
@lissyx Thanks, I learned alot during this disscusion. I will analyze recived info and comback with info if TF with IntelOneDNN makes DS inference faster on my platform
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
19
you should really first do streaming, then try just enabling AVX2, your tensorflow rebuild is much more invasive
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
20
it’s not even clear that the most intensive calls of our graph will be optimized, you should really:
profile
identify the bottlenecks
ensure there is actual optimized oneDNN version for those