Notes and steps for compiling natively on a Jetson machine. Goal: get
CUDA-enabled native_client==v0.1.1.
Repo setup
We need to compile mozilla’s tf 1.5 fork as well as the native_client
package provided as part of mozilla DeepSpeech.
cd $HOME/deepspeech #project root
git clone https://github.com/mozilla/DeepSpeech@master
git clone https://github.com/mozilla/tensorflow
#master breaks bazel.
cd tensorflow && git checkout r1.5
#put a symlink to native client
cd ../DeepSpeech
ln -s native_client ../tensorflow
cd $HOME
ln -s deepspeech/DeepSpeech ./
ln -s deepspeech/tensorflow ./
ARMv8 patches and local changes
First, we have to patch native_client /kenlm/util/double-conversion/utils.h to allow aarch64 to round
properly. Failure to do this means that kenlm won’t build.
diff --git a/native_client/kenlm/util/double-conversion/utils.h b/native_client/kenlm/util/double-conversion/utils.h
index 9ccb3b6..492b8bd 100644
--- a/native_client/kenlm/util/double-conversion/utils.h
+++ b/native_client/kenlm/util/double-conversion/utils.h
@@ -52,7 +52,7 @@
// the output of the division with the expected result. (Inlining must be
// disabled.)
// On Linux,x86 89255e-22 != Div_double(89255.0/1e22)
-#if defined(_M_X64) || defined(__x86_64__) || \
+#if defined(__aarch64__) || defined(_M_X64) || defined(__x86_64__) || \
defined(__ARMEL__) || defined(__avr32__) || \
defined(__hppa__) || defined(__ia64__) || \
defined(__mips__) || defined(__powerpc__) || \
Then, we can carefully construct a few shell scripts to build
tensorflow, then finally build native_client and wheels.
tensorflow
Using tensorflow/tc-build.sh as inspiration, we just want to pass the
right environment variables to bazel so we can run the whole thing as a
one-shot.
#!/bin/bash
set -ex
PROJECT_ROOT=$HOME/deepspeech
LD_LIBRARY_PATH=/usr/local/cuda/targets/aarch64-linux/lib/:/usr/local/cuda/targets/aarch64-linux/lib/stubs:$LD_LIBRARY_PATH
export TF_ENABLE_XLA=0
export TF_NEED_JEMALLOC=1
export TF_NEED_GCP=0
export TF_NEED_HDFS=0
export TF_NEED_OPENCL_SYCL=0
export TF_NEED_MKL=0
export TF_NEED_VERBS=0
export TF_NEED_MPI=0
export TF_NEED_S3=0
export TF_NEED_GDR=0
export TF_SET_ANDROID_WORKSPACE=0
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export TF_NEED_CUDA=1
export TX_CUDA_PATH='/usr/local/cuda'
export TX_CUDNN_PATH='/usr/lib/aarch64-linux-gnu/'
export TF_CUDA_FLAGS="TF_CUDA_CLANG=0 TF_CUDA_VERSION=8.0 TF_CUDNN_VERSION=6 CUDA_TOOLKIT_PATH=${TX_CUDA_PATH} CUDNN_INSTALL_PATH=${TX_CUDNN_PATH} TF_CUDA_COMPUTE_CAPABILITIES=\"3.0,3.5,3.7,5.2,5.3,6.0,6.1\""
cd ${PROJECT_ROOT}/tensorflow && \
eval "export ${TF_CUDA_FLAGS}" && (echo "" | ./configure) && \
bazel build -s --explain bazel_kenlm_tf.log \
--verbose_explanations \
-c opt \
--copt=-O3 \
--config=cuda \
//native_client:libctc_decoder_with_kenlm.so && \
bazel build -s --explain bazel_monolithic_tf.log \
--verbose_explanations \
--config=monolithic \
-c opt \
--copt=-O3 \
--config=cuda \
--copt=-fvisibility=hidden \
//native_client:libdeepspeech.so \
//native_client:deepspeech_utils \
//native_client:generate_trie
Completion
This builds cleanly, so we can inspect the contents of the libraries
we’ve made.
-
libdeepspeech.so
Are the symbols there? Looks like!
ubuntu@nvidia:~/deepspeech/tensorflow/bazel-bin/native_client$ nm -gC libdeepspeech.so | grep Model::Model 00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int) 00000000008390a0 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)Does it look like it linked sanely? Yes.
ubuntu@nvidia:~/deepspeech/tensorflow$ ldd bazel-bin/native_client/libdeepspeech.so linux-vdso.so.1 => (0x0000007f871b8000) libcusolver.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccusolver___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcusolver.so.8.0 (0x0000007f7e63a000) libcublas.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccublas___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcublas.so.8.0 (0x0000007f7b93c000) libcuda.so.1 => /usr/lib/libcuda.so.1 (0x0000007f7af61000) libcudnn.so.6 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudnn___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudnn.so.6 (0x0000007f7013a000) libcufft.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccufft___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcufft.so.8.0 (0x0000007f66a54000) libcurand.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccurand___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcurand.so.8.0 (0x0000007f633fb000) libcudart.so.8.0 => /home/ubuntu/deepspeech/tensorflow/bazel-bin/native_client/../_solib_local/_U@local_Uconfig_Ucuda_S_Scuda_Ccudart___Uexternal_Slocal_Uconfig_Ucuda_Scuda_Scuda_Slib/libcudart.so.8.0 (0x0000007f63397000) libgomp.so.1 => /usr/lib/aarch64-linux-gnu/libgomp.so.1 (0x0000007f63369000) libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007f63356000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007f632a8000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007f6327c000) libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007f630ed000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007f630cb000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007f62f84000) /lib/ld-linux-aarch64.so.1 (0x0000005557db8000) librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000007f62f6d000) libnvrm_gpu.so => /usr/lib/libnvrm_gpu.so (0x0000007f62f36000) libnvrm.so => /usr/lib/libnvrm.so (0x0000007f62efb000) libnvidia-fatbinaryloader.so.384.00 => /usr/lib/libnvidia-fatbinaryloader.so.384.00 (0x0000007f62e92000) libnvos.so => /usr/lib/libnvos.so (0x0000007f62e74000)
native client
To build the native client, next…
#!/bin/bash
SYSTEM_TARGET=host
EXTRA_LOCAL_CFLAGS="-march=armv8-a"
EXTRA_LOCAL_LDFLAGS="-L/usr/local/cuda/targets/aarch64-linux/lib/ -L/usr/local/cuda/targets/aarch64-linux/lib/stubs -lcudart -lcuda"
SETUP_FLAGS="--project_name deepspeech-gpu"
DS_TFDIR="${HOME}/deepspeech/tensorflow"
cd ./DeepSpeech
mkdir -p wheels
make clean
EXTRA_CFLAGS="${EXTRA_LOCAL_CFLAGS}" \
EXTRA_LDFLAGS="${EXTRA_LOCAL_LDFLAGS}" \
EXTRA_LIBS="${EXTRA_LOCAL_LIBS}" \
make -C native_client/ \TARGET=${SYSTEM_TARGET} \
TFDIR=${DS_TFDIR} \
SETUP_FLAGS="${SETUP_FLAGS}" \
bindings-clean bindings
cp native_client/dist/*.whl wheels
make -C native_client/ bindings-clean
and lo, we now have a wheel.
test
Does it work?
In [6]: ds = model.Model('output_graph.pb',26,9,'alphabet.txt',500)
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-23 14:06:05.179493: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:04:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.180794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GP106 major: 6 minor: 1 memoryClockRate(GHz): 1.29
pciBusID: 0000:04:00.0
totalMemory: 3.75GiB freeMemory: 3.67GiB
2018-02-23 14:06:05.270756: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-23 14:06:05.270920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties:
name: GP10B major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.50GiB freeMemory: 3.94GiB
2018-02-23 14:06:05.271027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-23 14:06:05.271090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1
2018-02-23 14:06:05.271117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0: Y N
2018-02-23 14:06:05.271157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1: N Y
2018-02-23 14:06:05.271247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GP106, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-02-23 14:06:05.271304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1182] Ignoring gpu device (device: 1, name: GP10B, pci bus id: 0000:00:00.0, compute capability: 6.2) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2018-02-23 14:08:22.530561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0. Your kernel may not have been built with NUMA support.
Looks hopeful…
In [1]: from scipy.io import wavfile
ds = Model('output_graph.pb',26,9,'alphabet.txt',500)
fs,wav = wavfile.read('test.wav')
ds.stt(wav,fs)
Out [1]: 'test'
and running tegrastats at the same time:
RAM 4642/6660MB (lfb 5x2MB) SWAP 811/8192MB (cached 67MB) cpu [10%@1991,0%@2034,0%@2035,7%@1992,5%@1995,8%@1993] EMC 0%@1600 GR3D 0%@1275 GR3D_PCI 98%@2607
So GPU is pegged and the CPU is nicely quiet. Finally!