GPU Inference on Jetson Nano

Im currently doing a bachelor thesis where we are looking to deploy DeepSpeech on an NVIDIA Jetson Nano. We followed the following guide to build DeepSpeech 0.6.0: The author says he succesfully build DeepSpeech with CUDA support for the Jetson Nano.

On the other hand reading this article from 23rd of January 2020:, the author writes the following when comparing inference times on the Jetson Nano vs the Rasperry Pi 4 as the latter has a faster CPU:
“There are no pre-built binaries for arm64 architecture with GPU support as of this moment, so we cannot take advantage of Nvidia Jetson Nano’s GPU for inference acceleration. I don’t think this task is on DeepSpeech team roadmap, so in the near future I’ll do some research here myself and will try to compile that binary to see what speed gains can be achieved from using GPU.”

So Im a bit confused on whether or not deepspeech is able to use the GPU for inference on the Jetson Nano? I seem to recall answers on forum posts in here where it is suggested that the goal of deepspeech is optimising for inference on CPU’s anyway.

Jetson Nano supports CUDA, we support CUDA. It’s just that we :

  • are a small team
  • have limited CI capacities
  • have limited usecase for GPU on those boards
  • don’t have those boards

So it should be possible to build on ARM64 with CUDA enabled, we just don’t provide prebuilt binaries.

I’ve repeated it myself several times, people who would like to contribute support for that are welcome. Cross-compiling is a bit painful with Bazel, but we document it and have it working on ARMv7 as well as ARM64, so it’s definitively possible.

Appreciate the answer
I must admit, I don’t have any experience with cross-compiling, but would probably like to give it a go at some point

It should not be super-complicated. We already have GCC ARM64 toolchain, so you can re-use that. Then I think it’s just a matter of properly setting up a sysroot tree (as we document using multistrap ) that includes CUDA, get inspiration from the current build:rpi3-armv8 in .bazelrc in tensorflow:

It might not be 100% straightforward because I think most people are afraid from doing that, but if you are reading the doc carefully and ask precise questions if needed, it’s 100% doable.

1 Like

Ok, I will let you know if I ever get anything working

1 Like

I am the “author” of the Jetson Nano build mentioned by @chrillemanden in the initial post.

I found the performance of the Jetson Nano with GPU a bit underwhelming for DeepSpeech inference. There is still a lot of computation done on the CPU and probably copying data between CPU- and GPU-memory area adds too much overhead.
Maybe I did miss some optimization flags for the compiler, too.

1 Like

Cool! Any reason not to rely on cross-compilation ?

Our graph has some ops that don’t have GPU implementation anyway, so its not surprising. Have you had a chance to compare to a desktop GPU and see if there’s a real difference in which ops gets executed on the GPU ?

Well maybe it’d be worth adapting the --copt= to match your precise ARM core, but I have never seen this having a real influence in our context.

Yes, my laziness :slight_smile: I used to develop Java and Python web applications in my day job about 10 years ago, I rarely had to deal with all that C/C++ build chain stuff - and Bazel makes my head explode…

What I understand from the build process for Tensorflow and DeepSpeech is that support for Aarch64/Arm64 architecture is often bound to Android-OS - maybe it would help to untangle this?

My desktop at home is a (older) Mac - so unfortunately no CUDA/GPU


Well, we cross-build for Linux/ARMv7 and Linux/Aarch64, with a few changes to TensorFlow to add cross-compiler, and a multistrap-generated sysroot.

I have build DeepSpeech 0.8.2 with CUDA support for Nvidia Jetson/Xavier - any feedback welcome…


Do you have patches to share for that?

No patches. Built with:
bazel build --workspace_status_command="bash native_client/" --config=monolithic --copt=-march=armv8-a --copt=-mtune=cortex-a57 --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt=-fvisibility=hidden --config=cuda --config=nonccl --config=noaws --config=nogcp --verbose_failures --config=nohdfs --config=v2 -—copt=-fPIC // //native_client:generate_scorer_package

Python bindings was a bit hacky, you have to manually build SWIG 3.0.2 and create/replace the symlinks in DeepSpeech/native_client/ds-swig. Didn’t manage to build C++ bindings, but I don’t need them anyway…

Right, you should though use the same SWIG version “just in case”, but it’s true we don’t have prebuilt versions for ARM64, so you need to build your own

It’s good to know it can work in-place. Would you like to help get that working through cross-compilation, if doable?

Editing Makefiles and hacking build processes comes next to visiting the dentist for me :crazy_face:
I will think about it and come back to you with a PR in case i find a painless solution…

1 Like