On the other hand reading this article from 23rd of January 2020: https://www.hackster.io/dmitrywat/offline-speech-recognition-on-raspberry-pi-4-with-respeaker-c537e7, the author writes the following when comparing inference times on the Jetson Nano vs the Rasperry Pi 4 as the latter has a faster CPU:
“There are no pre-built binaries for arm64 architecture with GPU support as of this moment, so we cannot take advantage of Nvidia Jetson Nano’s GPU for inference acceleration. I don’t think this task is on DeepSpeech team roadmap, so in the near future I’ll do some research here myself and will try to compile that binary to see what speed gains can be achieved from using GPU.”
So Im a bit confused on whether or not deepspeech is able to use the GPU for inference on the Jetson Nano? I seem to recall answers on forum posts in here where it is suggested that the goal of deepspeech is optimising for inference on CPU’s anyway.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Jetson Nano supports CUDA, we support CUDA. It’s just that we :
are a small team
have limited CI capacities
have limited usecase for GPU on those boards
don’t have those boards
So it should be possible to build on ARM64 with CUDA enabled, we just don’t provide prebuilt binaries.
I’ve repeated it myself several times, people who would like to contribute support for that are welcome. Cross-compiling is a bit painful with Bazel, but we document it and have it working on ARMv7 as well as ARM64, so it’s definitively possible.
Ok
Appreciate the answer
I must admit, I don’t have any experience with cross-compiling, but would probably like to give it a go at some point
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
It should not be super-complicated. We already have GCC ARM64 toolchain, so you can re-use that. Then I think it’s just a matter of properly setting up a sysroot tree (as we document using multistrap ) that includes CUDA, get inspiration from the current build:rpi3-armv8 in .bazelrc in tensorflow: tensorflow/.bazelrc at r1.15 · mozilla/tensorflow · GitHub
It might not be 100% straightforward because I think most people are afraid from doing that, but if you are reading the doc carefully and ask precise questions if needed, it’s 100% doable.
I am the “author” of the Jetson Nano build mentioned by @chrillemanden in the initial post.
I found the performance of the Jetson Nano with GPU a bit underwhelming for DeepSpeech inference. There is still a lot of computation done on the CPU and probably copying data between CPU- and GPU-memory area adds too much overhead.
Maybe I did miss some optimization flags for the compiler, too.
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
Cool! Any reason not to rely on cross-compilation ?
Our graph has some ops that don’t have GPU implementation anyway, so its not surprising. Have you had a chance to compare to a desktop GPU and see if there’s a real difference in which ops gets executed on the GPU ?
Well maybe it’d be worth adapting the --copt= to match your precise ARM core, but I have never seen this having a real influence in our context.
Yes, my laziness I used to develop Java and Python web applications in my day job about 10 years ago, I rarely had to deal with all that C/C++ build chain stuff - and Bazel makes my head explode…
What I understand from the build process for Tensorflow and DeepSpeech is that support for Aarch64/Arm64 architecture is often bound to Android-OS - maybe it would help to untangle this?
My desktop at home is a (older) Mac - so unfortunately no CUDA/GPU
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
Understandable
Well, we cross-build for Linux/ARMv7 and Linux/Aarch64, with a few changes to TensorFlow to add cross-compiler, and a multistrap-generated sysroot.
Python bindings was a bit hacky, you have to manually build SWIG 3.0.2 and create/replace the symlinks in DeepSpeech/native_client/ds-swig. Didn’t manage to build C++ bindings, but I don’t need them anyway…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
13
Right, you should though use the same SWIG version “just in case”, but it’s true we don’t have prebuilt versions for ARM64, so you need to build your own
It’s good to know it can work in-place. Would you like to help get that working through cross-compilation, if doable?
Editing Makefiles and hacking build processes comes next to visiting the dentist for me
I will think about it and come back to you with a PR in case i find a painless solution…