ARM native_client with GPU support

gvoysey · February 23, 2018, 11:29pm

Notes and steps for compiling natively on a Jetson machine. Goal: get
CUDA-enabled native_client==v0.1.1.

Repo setup

We need to compile mozilla’s tf 1.5 fork as well as the native_client
package provided as part of mozilla DeepSpeech.

cd $HOME/deepspeech  #project root
git clone https://github.com/mozilla/DeepSpeech@master
git clone https://github.com/mozilla/tensorflow
#master breaks bazel.
cd tensorflow && git checkout r1.5
#put a symlink to native client
cd ../DeepSpeech
ln -s native_client ../tensorflow
cd $HOME
ln -s deepspeech/DeepSpeech ./
ln -s deepspeech/tensorflow ./

ARMv8 patches and local changes

First, we have to patch native_client /kenlm/util/double-conversion/utils.h to allow aarch64 to round
properly. Failure to do this means that kenlm won’t build.

diff --git a/native_client/kenlm/util/double-conversion/utils.h b/native_client/kenlm/util/double-conversion/utils.h
index 9ccb3b6..492b8bd 100644
--- a/native_client/kenlm/util/double-conversion/utils.h
+++ b/native_client/kenlm/util/double-conversion/utils.h
@@ -52,7 +52,7 @@
 // the output of the division with the expected result. (Inlining must be
 // disabled.)
 // On Linux,x86 89255e-22 != Div_double(89255.0/1e22)
-#if defined(_M_X64) || defined(__x86_64__) || \
+#if defined(__aarch64__) || defined(_M_X64) || defined(__x86_64__) ||  \
     defined(__ARMEL__) || defined(__avr32__) || \
     defined(__hppa__) || defined(__ia64__) || \
     defined(__mips__) || defined(__powerpc__) || \

Then, we can carefully construct a few shell scripts to build
tensorflow, then finally build native_client and wheels.

`tensorflow`

Using tensorflow/tc-build.sh as inspiration, we just want to pass the
right environment variables to bazel so we can run the whole thing as a
one-shot.

#!/bin/bash

set -ex
PROJECT_ROOT=$HOME/deepspeech
LD_LIBRARY_PATH=/usr/local/cuda/targets/aarch64-linux/lib/:/usr/local/cuda/targets/aarch64-linux/lib/stubs:$LD_LIBRARY_PATH


export TF_ENABLE_XLA=0
export TF_NEED_JEMALLOC=1
export TF_NEED_GCP=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_MKL=0
export TF_NEED_VERBS=0
export TF_NEED_MPI=0
export TF_NEED_S3=0
export TF_NEED_GDR=0
export TF_SET_ANDROID_WORKSPACE=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export TF_NEED_CUDA=1
export TX_CUDA_PATH='/usr/local/cuda'
export TX_CUDNN_PATH='/usr/lib/aarch64-linux-gnu/'
export TF_CUDA_FLAGS="TF_CUDA_CLANG=0 TF_CUDA_VERSION=8.0 TF_CUDNN_VERSION=6 CUDA_TOOLKIT_PATH=${TX_CUDA_PATH} CUDNN_INSTALL_PATH=${TX_CUDNN_PATH} TF_CUDA_COMPUTE_CAPABILITIES=\"3.0,3.5,3.7,5.2,5.3,6.0,6.1\""

cd ${PROJECT_ROOT}/tensorflow && \
eval "export ${TF_CUDA_FLAGS}" && (echo "" | ./configure) && \
bazel build -s --explain bazel_kenlm_tf.log \
      --verbose_explanations \
      -c opt \
      --copt=-O3 \
      --config=cuda \
      //native_client:libctc_decoder_with_kenlm.so && \
bazel build -s --explain bazel_monolithic_tf.log \
      --verbose_explanations \
      --config=monolithic \
      -c opt \
      --copt=-O3 \
      --config=cuda \
      --copt=-fvisibility=hidden \
      //native_client:libdeepspeech.so \
      //native_client:deepspeech_utils \
      //native_client:generate_trie

Completion

This builds cleanly, so we can inspect the contents of the libraries
we’ve made.

libdeepspeech.so

Are the symbols there? Looks like!

ubuntu@nvidia:~/deepspeech/tensorflow/bazel-bin/native_client$ nm -gC libdeepspeech.so | grep Model::Model
00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)
00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)

Does it look like it linked sanely? Yes.

ubuntu@nvidia:~/deepspeech/tensorflow$ ldd bazel-bin/native_client/libdeepspeech.so
        linux-vdso.so.1 =>  (0x0000007f871b8000)
        libcusolver.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcusolver.so.8.0 (0x0000007f7e63a000)
        libcublas.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcublas.so.8.0 (0x0000007f7b93c000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x0000007f7af61000)
        libcudnn.so.6 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudnn___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudnn.so.6 (0x0000007f7013a000)
        libcufft.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccufft___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcufft.so.8.0 (0x0000007f66a54000)
        libcurand.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccurand___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcurand.so.8.0 (0x0000007f633fb000)
        libcudart.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.8.0 (0x0000007f63397000)
        libgomp.so.1 => /usr/lib/aarch64-linux-gnu/libgomp.so.1 (0x0000007f63369000)
        libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007f63356000)
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007f632a8000)
        libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007f6327c000)
        libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007f630ed000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007f630cb000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f62f84000)
        /lib/ld-linux-aarch64.so.1 (0x0000005557db8000)
        librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007f62f6d000)
        libnvrm_gpu.so => /usr/lib/libnvrm_gpu.so (0x0000007f62f36000)
        libnvrm.so => /usr/lib/libnvrm.so (0x0000007f62efb000)
        libnvidia-fatbinaryloader.so.384.00 => /usr/lib/libnvidia-fatbinaryloader.so.384.00 (0x0000007f62e92000)
        libnvos.so => /usr/lib/libnvos.so (0x0000007f62e74000)

`native client`

To build the native client, next…

#!/bin/bash

SYSTEM_TARGET=host
EXTRA_LOCAL_CFLAGS="-march=armv8-a"
EXTRA_LOCAL_LDFLAGS="-L/usr/local/cuda/targets/aarch64-linux/lib/ -L/usr/local/cuda/targets/aarch64-linux/lib/stubs -lcudart -lcuda"
SETUP_FLAGS="--project_name deepspeech-gpu"
DS_TFDIR="${HOME}/deepspeech/tensorflow"
cd ./DeepSpeech
mkdir -p wheels
make clean 
EXTRA_CFLAGS="${EXTRA_LOCAL_CFLAGS}" \
EXTRA_LDFLAGS="${EXTRA_LOCAL_LDFLAGS}" \
EXTRA_LIBS="${EXTRA_LOCAL_LIBS}" \
make -C native_client/ \TARGET=${SYSTEM_TARGET} \
      TFDIR=${DS_TFDIR} \
      SETUP_FLAGS="${SETUP_FLAGS}" \
      bindings-clean bindings

cp native_client/dist/*.whl wheels

make -C native_client/ bindings-clean

and lo, we now have a wheel.

test

Does it work?

In [6]: ds = model.Model('output_graph.pb',26,9,'alphabet.txt',500)
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-23 14:06:05.179493: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:04:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.180794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GP106 major: 6 minor: 1 memoryClockRate(GHz): 1.29
pciBusID: 0000:04:00.0
totalMemory: 3.75GiB freeMemory: 3.67GiB
2018-02-23 14:06:05.270756: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.270920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties: 
name: GP10B major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.50GiB freeMemory: 3.94GiB
2018-02-23 14:06:05.271027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-23 14:06:05.271090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 
2018-02-23 14:06:05.271117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0:   Y N 
2018-02-23 14:06:05.271157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1:   N Y 
2018-02-23 14:06:05.271247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GP106, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-02-23 14:06:05.271304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1182] Ignoring gpu device (device: 1, name: GP10B, pci bus id: 0000:00:00.0, compute capability: 6.2) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2018-02-23 14:08:22.530561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.

Looks hopeful…

In [1]: from scipy.io import wavfile
ds = Model('output_graph.pb',26,9,'alphabet.txt',500)
fs,wav = wavfile.read('test.wav')
ds.stt(wav,fs)
Out [1]: 'test'

and running tegrastats at the same time:

RAM 4642/6660MB (lfb 5x2MB) SWAP 811/8192MB (cached 67MB) cpu [10%@1991,0%@2034,0%@2035,7%@1992,5%@1995,8%@1993] EMC 0%@1600 GR3D 0%@1275 GR3D_PCI 98%@2607

So GPU is pegged and the CPU is nicely quiet. Finally!

Topic		Replies	Views
(Help) Building from source (for Jetson TX2) with cuda support DeepSpeech	13	1330	September 2, 2020
Unable to build deepspeech binary for v0.6.1 from scratch DeepSpeech	14	1575	April 22, 2020
How to build gpu version from source DeepSpeech	17	1728	March 17, 2018
Error during building Deep Speech binaries with native client DeepSpeech	8	699	November 29, 2019
Deepspeech installation on Nvidia Jetson TX2 DeepSpeech	20	3610	October 17, 2018