ARM native_client with GPU support

Notes and steps for compiling natively on a Jetson machine. Goal: get
CUDA-enabled native_client==v0.1.1.

Repo setup

We need to compile mozilla’s tf 1.5 fork as well as the native_client
package provided as part of mozilla DeepSpeech.

cd $HOME/deepspeech  #project root
git clone https://github.com/mozilla/DeepSpeech@master
git clone https://github.com/mozilla/tensorflow
#master breaks bazel.
cd tensorflow && git checkout r1.5
#put a symlink to native client
cd ../DeepSpeech
ln -s native_client ../tensorflow
cd $HOME
ln -s deepspeech/DeepSpeech ./
ln -s deepspeech/tensorflow ./

ARMv8 patches and local changes

First, we have to patch native_client /kenlm/util/double-conversion/utils.h to allow aarch64 to round
properly. Failure to do this means that kenlm won’t build.

diff --git a/native_client/kenlm/util/double-conversion/utils.h b/native_client/kenlm/util/double-conversion/utils.h
index 9ccb3b6..492b8bd 100644
--- a/native_client/kenlm/util/double-conversion/utils.h
+++ b/native_client/kenlm/util/double-conversion/utils.h
@@ -52,7 +52,7 @@
 // the output of the division with the expected result. (Inlining must be
 // disabled.)
 // On Linux,x86 89255e-22 != Div_double(89255.0/1e22)
-#if defined(_M_X64) || defined(__x86_64__) || \
+#if defined(__aarch64__) || defined(_M_X64) || defined(__x86_64__) ||  \
     defined(__ARMEL__) || defined(__avr32__) || \
     defined(__hppa__) || defined(__ia64__) || \
     defined(__mips__) || defined(__powerpc__) || \

Then, we can carefully construct a few shell scripts to build
tensorflow, then finally build native_client and wheels.

tensorflow

Using tensorflow/tc-build.sh as inspiration, we just want to pass the
right environment variables to bazel so we can run the whole thing as a
one-shot.

#!/bin/bash

set -ex
PROJECT_ROOT=$HOME/deepspeech
LD_LIBRARY_PATH=/usr/local/cuda/targets/aarch64-linux/lib/:/usr/local/cuda/targets/aarch64-linux/lib/stubs:$LD_LIBRARY_PATH


export TF_ENABLE_XLA=0
export TF_NEED_JEMALLOC=1
export TF_NEED_GCP=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_MKL=0
export TF_NEED_VERBS=0
export TF_NEED_MPI=0
export TF_NEED_S3=0
export TF_NEED_GDR=0
export TF_SET_ANDROID_WORKSPACE=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export TF_NEED_CUDA=1
export TX_CUDA_PATH='/usr/local/cuda'
export TX_CUDNN_PATH='/usr/lib/aarch64-linux-gnu/'
export TF_CUDA_FLAGS="TF_CUDA_CLANG=0 TF_CUDA_VERSION=8.0 TF_CUDNN_VERSION=6 CUDA_TOOLKIT_PATH=${TX_CUDA_PATH} CUDNN_INSTALL_PATH=${TX_CUDNN_PATH} TF_CUDA_COMPUTE_CAPABILITIES=\"3.0,3.5,3.7,5.2,5.3,6.0,6.1\""

cd ${PROJECT_ROOT}/tensorflow && \
eval "export ${TF_CUDA_FLAGS}" && (echo "" | ./configure) && \
bazel build -s --explain bazel_kenlm_tf.log \
      --verbose_explanations \
      -c opt \
      --copt=-O3 \
      --config=cuda \
      //native_client:libctc_decoder_with_kenlm.so && \
bazel build -s --explain bazel_monolithic_tf.log \
      --verbose_explanations \
      --config=monolithic \
      -c opt \
      --copt=-O3 \
      --config=cuda \
      --copt=-fvisibility=hidden \
      //native_client:libdeepspeech.so \
      //native_client:deepspeech_utils \
      //native_client:generate_trie

Completion

This builds cleanly, so we can inspect the contents of the libraries
we’ve made.

  1. libdeepspeech.so

    Are the symbols there? Looks like!

    ubuntu@nvidia:~/deepspeech/tensorflow/bazel-bin/native_client$ nm -gC libdeepspeech.so | grep Model::Model
    00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)
    00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)
    

    Does it look like it linked sanely? Yes.

    ubuntu@nvidia:~/deepspeech/tensorflow$ ldd bazel-bin/native_client/libdeepspeech.so
            linux-vdso.so.1 =>  (0x0000007f871b8000)
            libcusolver.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcusolver.so.8.0 (0x0000007f7e63a000)
            libcublas.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcublas.so.8.0 (0x0000007f7b93c000)
            libcuda.so.1 => /usr/lib/libcuda.so.1 (0x0000007f7af61000)
            libcudnn.so.6 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudnn___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudnn.so.6 (0x0000007f7013a000)
            libcufft.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccufft___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcufft.so.8.0 (0x0000007f66a54000)
            libcurand.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccurand___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcurand.so.8.0 (0x0000007f633fb000)
            libcudart.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.8.0 (0x0000007f63397000)
            libgomp.so.1 => /usr/lib/aarch64-linux-gnu/libgomp.so.1 (0x0000007f63369000)
            libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007f63356000)
            libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007f632a8000)
            libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007f6327c000)
            libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007f630ed000)
            libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007f630cb000)
            libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f62f84000)
            /lib/ld-linux-aarch64.so.1 (0x0000005557db8000)
            librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007f62f6d000)
            libnvrm_gpu.so => /usr/lib/libnvrm_gpu.so (0x0000007f62f36000)
            libnvrm.so => /usr/lib/libnvrm.so (0x0000007f62efb000)
            libnvidia-fatbinaryloader.so.384.00 => /usr/lib/libnvidia-fatbinaryloader.so.384.00 (0x0000007f62e92000)
            libnvos.so => /usr/lib/libnvos.so (0x0000007f62e74000)
    

native client

To build the native client, next…

#!/bin/bash

SYSTEM_TARGET=host
EXTRA_LOCAL_CFLAGS="-march=armv8-a"
EXTRA_LOCAL_LDFLAGS="-L/usr/local/cuda/targets/aarch64-linux/lib/ -L/usr/local/cuda/targets/aarch64-linux/lib/stubs -lcudart -lcuda"
SETUP_FLAGS="--project_name deepspeech-gpu"
DS_TFDIR="${HOME}/deepspeech/tensorflow"
cd ./DeepSpeech
mkdir -p wheels
make clean 
EXTRA_CFLAGS="${EXTRA_LOCAL_CFLAGS}" \
EXTRA_LDFLAGS="${EXTRA_LOCAL_LDFLAGS}" \
EXTRA_LIBS="${EXTRA_LOCAL_LIBS}" \
make -C native_client/ \TARGET=${SYSTEM_TARGET} \
      TFDIR=${DS_TFDIR} \
      SETUP_FLAGS="${SETUP_FLAGS}" \
      bindings-clean bindings

cp native_client/dist/*.whl wheels

make -C native_client/ bindings-clean

and lo, we now have a wheel.

test

Does it work?

In [6]: ds = model.Model('output_graph.pb',26,9,'alphabet.txt',500)
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-23 14:06:05.179493: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:04:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.180794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GP106 major: 6 minor: 1 memoryClockRate(GHz): 1.29
pciBusID: 0000:04:00.0
totalMemory: 3.75GiB freeMemory: 3.67GiB
2018-02-23 14:06:05.270756: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.270920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties: 
name: GP10B major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.50GiB freeMemory: 3.94GiB
2018-02-23 14:06:05.271027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-23 14:06:05.271090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 
2018-02-23 14:06:05.271117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0:   Y N 
2018-02-23 14:06:05.271157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1:   N Y 
2018-02-23 14:06:05.271247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GP106, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-02-23 14:06:05.271304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1182] Ignoring gpu device (device: 1, name: GP10B, pci bus id: 0000:00:00.0, compute capability: 6.2) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2018-02-23 14:08:22.530561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.

Looks hopeful…

In [1]: from scipy.io import wavfile
ds = Model('output_graph.pb',26,9,'alphabet.txt',500)
fs,wav = wavfile.read('test.wav')
ds.stt(wav,fs)
Out [1]: 'test'

and running tegrastats at the same time:

RAM 4642/6660MB (lfb 5x2MB) SWAP 811/8192MB (cached 67MB) cpu [10%@1991,0%@2034,0%@2035,7%@1992,5%@1995,8%@1993] EMC 0%@1600 GR3D 0%@1275 GR3D_PCI 98%@2607

So GPU is pegged and the CPU is nicely quiet. Finally!

4 Likes