ARM native_client with GPU support

(Lissyx) #21

You are deeply confused. You dont need to build tensorflow pip package to build the deepspeech Python wheel :). Make sure you are using the proper build flags when you build, if you are targetting CUDA, any bazel build statement should have --config=cuda.


ah ha! i misparsed what you were saying, yeah.

i know i don’t need the TF wheels to build the deepspeech-gpu wheel. I know i do need the deepspeech-gpu wheel.

(Lissyx) #23

Perfect, your wording was kind of unclear to me, I don’t want unclear instructions :slight_smile:

(Sai Kishor Kothakota) #24

@gvoysey I am bit confused at this moment.
If I understand what you said, you mean if I install the tensorflow-gpu from the pip and install deepspeech-gpu, I will have some troubles with it. As

but as per @lissyx, I won’t have issues using both of them together.


@saikishor i’ll defer to @lissyx here, but my point is simply that I do not believe that pip install deepspeech-gpu will work on an nvidia TX-class machine. (i’m pretty sure that pip install tensorflow-gpu won’t work on a TX either, but it’s moot)

If i’m wrong, i’ll be very happy, but I don’t think i am.

(Lissyx) #26

It would work if there were ARMv8 binaries for the system. We don’t provide any for DeepSpeech, because this would require setting up ARMv8 cross-compilation and while it’s do-able, we have more important things to focus on. You can do pip install <deepspeech.whl> once you have built on your ARMv8 system the Python package, though :).

(Sai Kishor Kothakota) #27

@gvoysey Yes, I also believe that pip install tensorflow-gpu doesn’t work, so I am going to follow something that was explained as per Jetsonhacks. As Jetson TX-2 is of ARMv8 system and as per what @lissyx mentioned now:

I guess the pip install <deepspeech.whl> should work, as per what lissyx is referring to. I guess TX-1 is equipped with ARMv6, So this requires this setting I suppose. What do you think?.

(Lissyx) #28

Why do you keep wanting to install tensorflow-gpu package ? If you are only running inference, you don’t need that. You might want deepspeech-gpu, but again, read what I said above on that: we only provide for ARMv6, so no GPU. Follow native_client/ to build.


@saikishor the tx-1 has Quad ARM® A57/2 MB L2; which are v8.


@lissyx do you recommend starting from DeepSpeech latest commit, or the v0.1.1 tag?

(Lissyx) #31

You should use master, that’s where all the fun is :). It’s bringing a lot of improvements as well …


got it. but pegged to, right?

(Sai Kishor Kothakota) #33

@lissyx Thank you that was helpful, so now I should build for ARMv8 as per the guidelines mentioned in native_client/ with the GPU hacks presented by @gvoysey and @elpimous_robot and get the .whl built, and install it using pip.

(Lissyx) #34

Yes, TensorFlow master has some slight differences that will make the bazel build choke on some definitions we have in native_client/BUILD. Don’t forget --config=cuda, if you need CUDA. We are trying to improve that, now the should point to the proper matching branch, with currently DeepSpeech/master being tied to tensorflow/r1.5 and DeepSpeech/tf-master being tied to tensorflow/master, in case you are curious / want to hack on more recent codebase.

(Sai Kishor Kothakota) #35

I guess there is lot of work to do :smile: @gvoysey

(Sai Kishor Kothakota) #36

@gvoysey are you going to start from the scratch for tensorflowv1.5, and I would like to know one more thing, are you developing the stuff in dockers or directly on your TX machine.

(Vincent Foucault) #37

hi @saikishor,

use mozilla/deepspeech/master1.5, mozilla/deepspeech/master
install requirement.txt, follow native_client/,
remember to add --config=cuda (to use tx1/2 gpu0)

and you should a nice deepspeech working on our fabulous nvidia boards !


Notes and steps for compiling natively on a Jetson machine. Goal: get
CUDA-enabled native_client==v0.1.1.

Repo setup

We need to compile mozilla’s tf 1.5 fork as well as the native_client
package provided as part of mozilla DeepSpeech.

cd $HOME/deepspeech  #project root
git clone
git clone
#master breaks bazel.
cd tensorflow && git checkout r1.5
#put a symlink to native client
cd ../DeepSpeech
ln -s native_client ../tensorflow
cd $HOME
ln -s deepspeech/DeepSpeech ./
ln -s deepspeech/tensorflow ./

ARMv8 patches and local changes

First, we have to patch native_client /kenlm/util/double-conversion/utils.h to allow aarch64 to round
properly. Failure to do this means that kenlm won’t build.

diff --git a/native_client/kenlm/util/double-conversion/utils.h b/native_client/kenlm/util/double-conversion/utils.h
index 9ccb3b6..492b8bd 100644
--- a/native_client/kenlm/util/double-conversion/utils.h
+++ b/native_client/kenlm/util/double-conversion/utils.h
@@ -52,7 +52,7 @@
 // the output of the division with the expected result. (Inlining must be
 // disabled.)
 // On Linux,x86 89255e-22 != Div_double(89255.0/1e22)
-#if defined(_M_X64) || defined(__x86_64__) || \
+#if defined(__aarch64__) || defined(_M_X64) || defined(__x86_64__) ||  \
     defined(__ARMEL__) || defined(__avr32__) || \
     defined(__hppa__) || defined(__ia64__) || \
     defined(__mips__) || defined(__powerpc__) || \

Then, we can carefully construct a few shell scripts to build
tensorflow, then finally build native_client and wheels.


Using tensorflow/ as inspiration, we just want to pass the
right environment variables to bazel so we can run the whole thing as a


set -ex

export TF_ENABLE_XLA=0
export TF_NEED_GCP=0
export TF_NEED_HDFS=0
export TF_NEED_MKL=0
export TF_NEED_VERBS=0
export TF_NEED_MPI=0
export TF_NEED_S3=0
export TF_NEED_GDR=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export TF_NEED_CUDA=1
export TX_CUDA_PATH='/usr/local/cuda'
export TX_CUDNN_PATH='/usr/lib/aarch64-linux-gnu/'

cd ${PROJECT_ROOT}/tensorflow && \
eval "export ${TF_CUDA_FLAGS}" && (echo "" | ./configure) && \
bazel build -s --explain bazel_kenlm_tf.log \
      --verbose_explanations \
      -c opt \
      --copt=-O3 \
      --config=cuda \
      // && \
bazel build -s --explain bazel_monolithic_tf.log \
      --verbose_explanations \
      --config=monolithic \
      -c opt \
      --copt=-O3 \
      --config=cuda \
      --copt=-fvisibility=hidden \
      // \
      //native_client:deepspeech_utils \


This builds cleanly, so we can inspect the contents of the libraries
we’ve made.


    Are the symbols there? Looks like!

    ubuntu@nvidia:~/deepspeech/tensorflow/bazel-bin/native_client$ nm -gC | grep Model::Model
    00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)
    00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)

    Does it look like it linked sanely? Yes.

    ubuntu@nvidia:~/deepspeech/tensorflow$ ldd bazel-bin/native_client/
   =>  (0x0000007f871b8000)
   => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/ (0x0000007f7e63a000)
   => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/ (0x0000007f7b93c000)
   => /usr/lib/ (0x0000007f7af61000)
   => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudnn___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/ (0x0000007f7013a000)
   => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccufft___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/ (0x0000007f66a54000)
   => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccurand___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/ (0x0000007f633fb000)
   => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/ (0x0000007f63397000)
   => /usr/lib/aarch64-linux-gnu/ (0x0000007f63369000)
   => /lib/aarch64-linux-gnu/ (0x0000007f63356000)
   => /lib/aarch64-linux-gnu/ (0x0000007f632a8000)
   => /lib/aarch64-linux-gnu/ (0x0000007f6327c000)
   => /usr/lib/aarch64-linux-gnu/ (0x0000007f630ed000)
   => /lib/aarch64-linux-gnu/ (0x0000007f630cb000)
   => /lib/aarch64-linux-gnu/ (0x0000007f62f84000)
            /lib/ (0x0000005557db8000)
   => /lib/aarch64-linux-gnu/ (0x0000007f62f6d000)
   => /usr/lib/ (0x0000007f62f36000)
   => /usr/lib/ (0x0000007f62efb000)
   => /usr/lib/ (0x0000007f62e92000)
   => /usr/lib/ (0x0000007f62e74000)

native client

To build the native client, next…


EXTRA_LOCAL_LDFLAGS="-L/usr/local/cuda/targets/aarch64-linux/lib/ -L/usr/local/cuda/targets/aarch64-linux/lib/stubs -lcudart -lcuda"
SETUP_FLAGS="--project_name deepspeech-gpu"
cd ./DeepSpeech
mkdir -p wheels
make clean 
make -C native_client/ \TARGET=${SYSTEM_TARGET} \
      TFDIR=${DS_TFDIR} \
      bindings-clean bindings

cp native_client/dist/*.whl wheels

make -C native_client/ bindings-clean

and lo, we now have a wheel.


Does it work?

In [6]: ds = model.Model('output_graph.pb',26,9,'alphabet.txt',500)
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-23 14:06:05.179493: E tensorflow/stream_executor/cuda/] could not open file to read NUMA node: /sys/bus/pci/devices/0000:04:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.180794: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: GP106 major: 6 minor: 1 memoryClockRate(GHz): 1.29
pciBusID: 0000:04:00.0
totalMemory: 3.75GiB freeMemory: 3.67GiB
2018-02-23 14:06:05.270756: E tensorflow/stream_executor/cuda/] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.270920: I tensorflow/core/common_runtime/gpu/] Found device 1 with properties: 
name: GP10B major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.50GiB freeMemory: 3.94GiB
2018-02-23 14:06:05.271027: I tensorflow/core/common_runtime/gpu/] Device peer to peer matrix
2018-02-23 14:06:05.271090: I tensorflow/core/common_runtime/gpu/] DMA: 0 1 
2018-02-23 14:06:05.271117: I tensorflow/core/common_runtime/gpu/] 0:   Y N 
2018-02-23 14:06:05.271157: I tensorflow/core/common_runtime/gpu/] 1:   N Y 
2018-02-23 14:06:05.271247: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GP106, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-02-23 14:06:05.271304: I tensorflow/core/common_runtime/gpu/] Ignoring gpu device (device: 1, name: GP10B, pci bus id: 0000:00:00.0, compute capability: 6.2) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2018-02-23 14:08:22.530561: I tensorflow/core/common_runtime/gpu/] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.

Looks hopeful…

In [1]: from import wavfile
ds = Model('output_graph.pb',26,9,'alphabet.txt',500)
fs,wav ='test.wav')
Out [1]: 'test'

and running tegrastats at the same time:

RAM 4642/6660MB (lfb 5x2MB) SWAP 811/8192MB (cached 67MB) cpu [10%@1991,0%@2034,0%@2035,7%@1992,5%@1995,8%@1993] EMC 0%@1600 GR3D 0%@1275 GR3D_PCI 98%@2607

So GPU is pegged and the CPU is nicely quiet. Finally!

(Sai Kishor Kothakota) #40

@gvoysey If I am using --config=cuda as a parameter, then I am getting the below error, without that it build fine. Can you help me at this point.

nvidia@tegra-ubuntu:~/deepspeech/tensorflow$ bazel build -c opt --copt=-O3 // --config=cuda
ERROR: /home/nvidia/.cache/bazel/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/local_config_cuda/crosstool/BUILD:4:1: Traceback (most recent call last):
	File "/home/nvidia/.cache/bazel/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/local_config_cuda/crosstool/BUILD", line 4
	File "/home/nvidia/.cache/bazel/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/local_config_cuda/crosstool/error_gpu_disabled.bzl", line 3, in error_gpu_disabled
		fail("ERROR: Building with --config=c...")
ERROR: Building with --config=cuda but TensorFlow is not configured to build with GPU support. Please re-run ./configure and enter 'Y' at the prompt to build with GPU support.
ERROR: no such target '@local_config_cuda//crosstool:toolchain': target 'toolchain' not declared in package 'crosstool' defined by /home/nvidia/.cache/bazel/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/local_config_cuda/crosstool/BUILD.
INFO: Elapsed time: 3.482s

At the first instance, I have the above error and I solved it by running ./configure and setting CUDA and CuDNN and later I face the following issue.

Configuration I have set:
nvidia@tegra-ubuntu:~/deepspeech/tensorflow$ ./configure
You have bazel 0.5.4- (@non-git) installed.
Please specify the location of python. [Default is /usr/bin/python]:

Found possible Python library paths:
Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: y
Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: y
Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: y
Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: y
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: y
GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: y
VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 8.0

Please specify the location where CUDA 8.0 toolkit is installed. Refer to for more details. [Default is /usr/local/cuda]: 

Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 6.0.21

Please specify the location where cuDNN 6.0.21 library is installed. Refer to for more details. [Default is /usr/local/cuda]:

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at:
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]3.5,5.2,6.2

Do you want to use clang as CUDA compiler? [y/N]: y
Clang will be used as CUDA compiler.

Please specify which clang should be used as device and host compiler. [Default is ]: /usr/bin/g++

Do you wish to build TensorFlow with MPI support? [y/N]: y
MPI support will be enabled for TensorFlow.

Please specify the MPI toolkit folder. [Default is /usr]: 

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 

Add "--config=mkl" to your bazel command to build with MKL support.
Please note that MKL on MacOS or windows is still not supported.
If you would like to use a local MKL instead of downloading, please set the environment variable "TF_MKL_ROOT" every time before build.

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Configuration finished

This is the output, I got after setting the configuration

nvidia@tegra-ubuntu:~/deepspeech/tensorflow$ bazel build -c opt --copt=-O3 // --config=cuda
ERROR: /home/nvidia/.cache/bazel/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/org_tensorflow/tensorflow/BUILD:703:1: Illegal ambiguous match on configurable attribute "deps" in @org_tensorflow//
Multiple matches are not allowed unless one is unambiguously more specialized.
ERROR: Analysis of target '//' failed; build aborted.
INFO: Elapsed time: 0.220s

(Lissyx) #41

Please retry without enabling everything: you don’t need S3, Google Cloud, HADOOP, GDR, VERBS, MPI, and you don’t want to use Clang for CUDA.