ARM native_client with GPU support


I am attempting to build a version of deepspeech-gpu bindings and the native_client for ARMv8 with GPU support. The target platform is NVIDIA’s Jetson-class embedded systems – the TX-1/2 in particular, but I have access to a PX2 as well.

These systems run ubuntu 16.04 LTS for aarch64. Cuda 8.0, Cudnn 6, and the compute capability is 5.2.

I have the Deepspeech repo as of commit e5757d21a38d40923c1de9b86597685f365150ee, the Mozilla fork of tensorflow as of commit 08894f64fc67b7a8031fc68cb838a27009c3e6e6, and bazel 0.5.4. My python version is 3.5.2.

I have added the --config=cuda option to the suggested build command. Here’s the session output:

ubuntu@nvidia:~/Source/deepspeech/tensorflow$ bazel build -c opt --config=cuda --copt=-O3 // // //native_client:deepspeech //native_client:deepspeech_utils // //native_client:generate_trie
547 / 671] Compiling native_client/kenlm/util/double-conversion/bignum-dtoERROR: /home/ubuntu/Source/deepspeech/tensorflow/native_client/BUILD:48:1:C++ compilation of rule '//native_client:deepspeech' failed (Exit 1).
In file included from native_client/kenlm/util/double-conversion/bignum-dtoa.h:31:0, from native_client/kenlm/util/double-conversion/
native_client/kenlm/util/double-conversion/utils.h:71:2: error: #error Target architecture was not detected as supported by Double-Conversion.
 #error Target architecture was not detected as supported by Double-Conversion.

What is a more appropriate list of build targets to give bazel? I’m willing to go without the language model for now if i have to – the raw output from the NN is good enough for my purposes right now.

(Lissyx) #2

Thanks for testing this! I know that @elpimous_robot succeeded in this setup and he had to add a small patch on top of KenLM. As much as I can tell, he was in the process of submitting this patch upstream.

(Vincent Foucault) #3

open this file (change with your link):

Add it : defined(__aarch64__) ||

// On Linux,x86 89255e-22 != Div_double(89255.0/1e22)
#if defined(_M_X64) || defined(__x86_64__) || \
    defined(__ARMEL__) || defined(__avr32__) || \
    defined(__hppa__) || defined(__ia64__) || \
    defined(__mips__) || defined(__powerpc__) || \
    defined(__sparc__) || defined(__sparc) || defined(__s390__) || \
    defined(__SH4__) || defined(__alpha__) || defined(__aarch64__) || \
#elif defined(_M_IX86) || defined(__i386__) || defined(__i386)
#if defined(_WIN32)

That’s all.


Okay, that really helped a lot.

I can make a wheel – @lissyx , are all wheels named deepspeech-0.1.0-... or must I do something else to get deepspeech-gpu ?

(Lissyx) #5

If you want to make wheels available, you should take a look at, it does document it :slight_smile:
But at some point, if you can, it would be better to just work on adding ARMv8 cross-compilation, that would benefit for everybody.

(Lissyx) #6

More precisely @gvoysey, it is there:


@lissyx i’ll update in ~18 hours with my progress.

I had thought that the tf build_pip_package tool was just for building wheels of tensorflow itself, so i’ll investigate further.

(Lissyx) #8

Oh, right. Sorry, I misread. It’s handled there: and for GPU builds you need to pass --project_name deepspeech-gpu :slight_smile:


I’m back. I think i’m close, but running into some last few troublesome things. A bunch of stuff changed and improved, so I’ve readjusted my build steps, and hopefully documented them well, below. Very long error logs are included as links to a pastebin.

@lissyx @elpimous_robot – anything jump out at you in what’s below?

Goal: compile deepspeech native_client for ARMv8 (aarch64) with GPU support


All work done was performed on an NVIDIA TX-1 running jetpack 3.1. The
kernel was recompiled to support swap files, and an 8GB swap file was

Prep Work

FIrst, just set up the repos

Clone mozilla’s deepspeech and tensorflow libraries at the right versions

mkdir $HOME/deepspeech
cd $HOME/deepspeech
git clone
git clone
cd $HOME
ln -s deepspeech/tensorflow ./
ln -s deepspeech/DeepSpeech ./


I adjust the cuda paths to adapt to what’s true on the TX-1: diff below.

git diff
diff --git a/ b/
index dec1ad7..f372dc4 100755
--- a/
+++ b/
@@ -95,7 +95,9 @@ if [ "${OS}" = "Darwin" ]; then

 ### Define build parameters/env variables that we will re-ues in sourcing scripts.


Update tc-build to add a new option for building tensorflow natively on
ARMv8 with CUDA support (using the vars set in

git diff
diff --git a/ b/
index 31c4d69..a7d432e 100755
--- a/
+++ b/
@@ -11,14 +11,18 @@ if [ "$1" = "--gpu" ]; then

-if [ "$1" = "--arm" ]; then
-    build_gpu=no
+if [ "$2" = "--arm" ]; then

-pushd ${DS_ROOT_TASK}/DeepSpeech/tf/
+pushd ${DS_ROOT_TASK}/tensorflow
     BAZEL_BUILD="bazel ${BAZEL_OUTPUT_USER_ROOT} build -s --explain bazel_monolithic_tf.log --verbose_explanations --experimental_strict_action_env --config=monolithic"

+    # experimental aarch64 GPU build (NVIDIA Jetson-class devices)
+    if [ "${build_gpu}" = "yes" -a "${build_arm}" = "yes" ]; then
+    fi
     # Pure amd64 CPU-only build

Build tensorflow 1.4

By running --gpu --arm, we obtain this tree in

(very long paste), which contains the build targets we specified. In
particular,,, etc. are all
built and of reasonable sizes.

Attempt to build native-client

Next, I adapt the flow from taskcluster (as suggested on
and created $HOME/deepspeech/DeepSpeech/taskcluster/


set -xe

source $(dirname "$0")/../

source ${DS_ROOT_TASK}/tensorflow/


EXTRA_LOCAL_LDFLAGS="-L/usr/local/cuda/targets/aarch64-linux/lib/ -L/usr/local/cuda/targets/aarch64-linux/lib/stubs -lcudart -lcuda"


  # unset PYTHONPATH
  # export PYENV_ROOT="${DS_ROOT_TASK}/DeepSpeech/.pyenv"
  # export PATH="${PYENV_ROOT}/bin:$PATH"

  # install_pyenv "${PYENV_ROOT}"
  # install_pyenv_virtualenv "$(pyenv root)/plugins/pyenv-virtualenv"

  mkdir -p wheels

  if [ "${rename_to_gpu}" ]; then
    SETUP_FLAGS="--project_name deepspeech-gpu"

  # for pyver in ${SUPPORTED_PYTHON_VERSIONS}; do
  #   pyenv install ${pyver}
  #   pyenv virtualenv ${pyver} deepspeech
  #   source ${PYENV_ROOT}/versions/${pyver}/envs/deepspeech/bin/activate

#      RASPBIAN=/tmp/multistrap-raspbian-jessie \
      TFDIR=${DS_TFDIR} \
      bindings-clean bindings

    cp native_client/dist/*.whl wheels

    make -C native_client/ bindings-clean

    # deactivate
 #   pyenv uninstall --force deepspeech
 # done;

deepspeech_python_build rename_to_gpu

#do_deepspeech_nodejs_build rename_to_gpu

$(dirname "$0")/

Running this failed quickly with ld failing to find Model::Model in

With nm, we can inspect and see that indeed the
symbols are missing. They are present, however, in libdeepspeech.a:

ubuntu@nvidia:~/tensorflow/bazel-bin/native_client$ nm -gC | grep Model::Model
ubuntu@nvidia:~/tensorflow/bazel-bin/native_client$ nm -gC libdeepspeech.a | grep Model::Model
0000000000000000 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)
0000000000000000 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)

We move past this point with some trepidation, but changing
DeepSpeech/native_client/ as follows yielded some

ubuntu@nvidia:~/deepspeech/DeepSpeech/native_client$ git diff 52adc2b2ddfb70eebfea84ada44f74af29336f2b:native_client/
diff --git a/native_client/ b/native_client/
index 32a4d80..622e88d 100644
--- a/native_client/
+++ b/native_client/
@@ -48,8 +48,8 @@ LDFLAGS_RPATH  := -Wl,-rpath,@executable_path

-LIBS    := -ldeepspeech -ldeepspeech_utils $(EXTRA_LIBS)
-LDFLAGS_DIRS := -L${TFDIR}/bazel-bin/native_client $(EXTRA_LDFLAGS)
+LIBS    := -ltensorflow_so -l:libdeepspeech.a -l:libdeepspeech_utils.a $(EXTRA_LIBS)
+LDFLAGS_DIRS := -L${TFDIR}/bazel-bin/tensorflow -L${TFDIR}/bazel-bin/native_client $(EXTRA_LDFLAGS)

Now we see the symbols in libdeepspeech.a.

However, rerunning the native client build now yields this very long
, which
seems to make it past missing Model::Model, only to fail on finding
symbols in in a very similar fashion.

At this point i started to get worried that my bazel build step was
totally wrong in some way that broke the linker.


  1. How can i make *.so files only, and skip making .a entirely?
  2. What could cause symbols to be stripped from the .so in this
  3. How close am I to home base?

(Lissyx) #10

I’m sorry, you should move to r1.5 now, we build monolithic all in one :). There’s too much changes wrt to your questions that it’s hard to answer.

(Lissyx) #11

I’d like to add that in the end, if you just follow the build instructions in native_client/, it should be straightforward: @elpimous_robot successfully updated his codebase the other day on current r1.5+master, without any issue.

Unless you want to get ARMv8 cross-compilation to work (which could be cool), there’s little to no value and need in hacking directly the taskcluster code :slight_smile:


@lissyx @elpimous_robot @gvoysey Thanks for the clear explanation. I have few questions to post, I see that you were using the tensorflow repository from mozilla group. Isn’t the normal tensorflow installation using pip from their official website fine?. Atleast, can I use the installation from jetsonhacks. Because in the later stage, I would like to run DeepSpeech and Object Detection from the models.

@lissyx as per your recent post, if I modify the code that elpimous_robot pointed out and build it, you say that I will be able to generate a python package that will be able to run DeepSpeech on Jetson GPU right?.

(Lissyx) #13

You should have nothing to modify. If you are referring to KenLM, yes, you might need to do that, but it’s unrelated to the Python package.

What exactly do you want to run on your jetson? Training? Only inference?


@saikishor i have been proceeding under these assumptions:

  1. the mozilla fork of tensorflow contains things that deepspeech needs but google’s TF doesn’t provide
  • including ARM-specific code
  • including CTC-related features
  1. the tensorflow pip package is a precompiled binary, not something you can use to build deepspeech against. No precompiled packages are available for ARMv8/aarch64 + CUDA, thus my attempts to build everything myself.

  2. Once native_client is actually built, i can deploy it with its libraries to arbitrary TX1 without having to repeat (1) and (2).

@lissyx i’m not who you asked, but my goal is to have inference on a TX1 with CUDA. I have much nicer x86_64-based systems for training, thankfully with precompiled everything included :slight_smile:

@lissyx I’m moving forward with the mozilla 1.5 tf process – stay tuned :slight_smile: Previous attempts to follow native_client/readme.rst have resulted in builds that succeed but seem totally ignorant of the existence of a GPU at all. (and thus have ~15 second inference times on a ~1s utterance, which is no bueno).


Thanks for the imminent reply. I want to do only inference on Jetson TX2, in this case, I can normally install deepspeech and proceed? or are there any steps to be performed as above?

(Lissyx) #16

If you are only running inference, there’s no need for the Python wheel package. Regarding your build that could not use GPU, I would need more information on what you did, I cannot help you without knowing.

(Lissyx) #17

Two solutions: setup ARMv6 multilib on your system and use the prebuilt binaries (for CPU only), or build for ARMv8. In any case, if you are only doing inference, you don’t care about the Python wheel.


Cool!!! Will this consider the GPU for the inference, if I installed deepspeech-gpu on jetson TX2?

you mean as what @gvoysey explained earlier



@lissyx i am only doing inference but i’m using the python bindings as part of a larger framework – so i do care about the wheel (or I am deeply confused).

regarding my no-GPU build – i am throwing it out and bumping to mozilla tf 1.5, on which i will take very detailed notes indeed.

(Lissyx) #20

I mean as documented in native_client/ His instructions are just slight variations but there is really nothing different.