ARM native_client with GPU support

@saikishor You might need to ensure you gave some older GCC (4.9 is what I use) and set this env var when running configure: GCC_HOST_COMPILER_PATH=/usr/bin/gcc-4.9

1 Like

Thanks a lot @lissyx, your guidelines helped me to build a .whl file. I am able to successfully install the GPU version on my Nvidia TX-2.

I tried with gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609 and it worked in my case.

@elpimous_robot @gvoysey @lissyx I don’t know whether you faced this issue or not. The inference time on TX-2 is very long and sometimes it even crashes. Do you have any idea to fix this issue?.

Successful Inference:

nvidia@tegra-ubuntu:~/DeepSpeech$ deepspeech models/output_graph.pb models/alphabet.txt data/Speech_test_data/can_you_find_a_seat_for_me.wav 
Loading model from file models/output_graph.pb
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-27 13:51:44.790188: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-27 13:51:44.790334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.14GiB
2018-02-27 13:51:44.790440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-27 13:51:45.309046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
Loaded model in 4.254s.
Running inference.
can you find his seat for me
Inference took 22.885s for 4.000s audio file.

Incomplete Inference:

nvidia@tegra-ubuntu:~/DeepSpeech$ deepspeech models/output_graph.pb models/alphabet.txt data/Speech_test_data/get_me_a_glass_of_water.wav 
Loading model from file models/output_graph.pb
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-27 13:52:50.278300: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-27 13:52:50.278444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 4.75GiB
2018-02-27 13:52:50.278518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-27 13:52:50.782133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
Loaded model in 3.266s.
Running inference.
2018-02-27 13:53:12.079776: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED
2018-02-27 13:53:12.080044: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1
Aborted (core dumped)


nvidia@tegra-ubuntu:~/DeepSpeech$ deepspeech models/output_graph.pb models/alphabet.txt models/lm.binary models/trie data/Speech_test_data/get_me_a_glass_of_water_sampled.wav 
Loading model from file models/output_graph.pb
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-27 14:31:22.311143: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2018-02-27 14:31:22.311272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 2.30GiB
2018-02-27 14:31:22.311360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-02-27 14:31:22.801195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

I had some issues on ARMv6 with RPi3 but that was when experimenting funny stuff, default was working. I don’t have an idea what is going on in your case, and I don’t have time to investigate.

OK thanks for the reply. Let me wait for the reply from the other two…

Worst case, but it’s going to be slow, try with a tensorflow debug build (-c dbg instead of -c opt on the command line) and then gdb the C++ client, to see where it is breaking. There might be a legit issue?

1 Like

Hi,

I am trying to build the native_client on NVidia TX2 with the following configuration:

  • JetPack 3.2
  • CUDA 9.0
  • cuDNN 7.0
  • gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
  • tensorflow (from mozila) at commit: ad8f785459e80823a2ff4456eeb9d7220c33b9c6
  • DeepSpeech at commit: 3c546d50059d468ea199814d77bac4ea97b5ee57

and when I am running:

bazel build -s -c opt --copt=-O3 --config=cuda //native_client:libctc_decoder_with_kenlm.so

I am getting this error:

SUBCOMMAND: # @protobuf_archive//:protobuf_lite [action 'Compiling external/protobuf_archive/src/google/protobuf/message_lite.cc [for host]']
(cd /home/nvidia/data_1/bazel_cache/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/execroot/org_tensorflow && \
  exec env - \
    LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64: \
    PATH=/usr/local/cuda-9.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin \
    PWD=/proc/self/cwd \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections '-std=c++11' -MD -MF bazel-out/host/bin/external/protobuf_archive/_objs/protobuf_lite/external/protobuf_archive/src/google/protobuf/message_lite.d '-frandom-seed=bazel-out/host/bin/external/protobuf_archive/_objs/protobuf_lite/external/protobuf_archive/src/google/protobuf/message_lite.o' -iquote external/protobuf_archive -iquote bazel-out/host/genfiles/external/protobuf_archive -iquote external/bazel_tools -iquote bazel-out/host/genfiles/external/bazel_tools -isystem external/protobuf_archive/src -isystem bazel-out/host/genfiles/external/protobuf_archive/src -isystem bazel-out/host/bin/external/protobuf_archive/src -g0 -DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK -g0 -DHAVE_PTHREAD -Wall -Wwrite-strings -Woverloaded-virtual -Wno-sign-compare -Wno-unused-function -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fno-canonical-system-headers -c external/protobuf_archive/src/google/protobuf/message_lite.cc -o bazel-out/host/bin/external/protobuf_archive/_objs/protobuf_lite/external/protobuf_archive/src/google/protobuf/message_lite.o)
ERROR: /home/nvidia/data_1/bazel_cache/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/jpeg/BUILD:225:1: C++ compilation of rule '@jpeg//:simd_armv7a' failed (Exit 1)
gcc: error: unrecognized command line option '-mfloat-abi=softfp'
Target //native_client:libctc_decoder_with_kenlm.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 19.744s, Critical Path: 7.05s
INFO: 36 processes: 36 local.
FAILED: Build did NOT complete successfully

Could you please give me an idea how to move further? I understand that @elpimous_robot has more experience with DeepSpeech on TX2, hence @elpimous_robot do you have some tips for me?

Many thanks in advance!

Okay, so first, why ad8f785459e80823a2ff4456eeb9d7220c33b9c6 ? Please use r1.6 branch, which is currently at https://github.com/mozilla/tensorflow/commit/50214731ea43f41ee036ce9af0c0c4a10185fc8f

No particular reason for the ad8f785459e80823a2ff4456eeb9d7220c33b9c6 commit.

I switched to r1.6 and I get a similar error with respect to -mfloat-abi=softfp :

SUBCOMMAND: # @jpeg//:jpeg [action 'Compiling external/jpeg/jquant2.c']
(cd /home/nvidia/data_1/bazel_cache/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/execroot/org_tensorflow && \
  exec env - \
    CUDA_TOOLKIT_PATH=/usr/local/cuda \
    CUDNN_INSTALL_PATH=/usr/lib/aarch64-linux-gnu \
    GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
    LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64: \
    PATH=/usr/local/cuda-9.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin \
    PWD=/proc/self/cwd \
    PYTHON_BIN_PATH=/usr/bin/python \
    PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
    TF_CUDA_CLANG=0 \
    TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,3.7,5.2,5.3,6.0,6.1,6.2 \
    TF_CUDA_VERSION=9.0 \
    TF_CUDNN_VERSION=7.0.5 \
    TF_NEED_CUDA=1 \
    TF_NEED_OPENCL_SYCL=0 \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -MD -MF bazel-out/arm-opt/bin/external/jpeg/_objs/jpeg/external/jpeg/jquant2.pic.d -fPIC -iquote external/jpeg -iquote bazel-out/arm-opt/genfiles/external/jpeg -iquote external/bazel_tools -iquote bazel-out/arm-opt/genfiles/external/bazel_tools -DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK -O3 -O3 -w -D__ARM_NEON__ '-march=armv7-a' '-mfloat-abi=softfp' -fprefetch-loop-arrays -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fno-canonical-system-headers -c external/jpeg/jquant2.c -o bazel-out/arm-opt/bin/external/jpeg/_objs/jpeg/external/jpeg/jquant2.pic.o)
ERROR: /home/nvidia/data_1/bazel_cache/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/jpeg/BUILD:44:1: C++ compilation of rule '@jpeg//:jpeg' failed (Exit 1)
gcc: error: unrecognized command line option '-mfloat-abi=softfp'
Target //native_client:libctc_decoder_with_kenlm.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 79.330s, Critical Path: 30.16s
INFO: 113 processes: 113 local.
FAILED: Build did NOT complete successfully

Can you include your bazel command line? And the Bazel version ? And ensure you don’t have some stale cache at Bazel.

sure, these are the commands I am running:

bazel clean
bazel build -s -c opt --copt=-O3 --config=cuda //native_client:libctc_decoder_with_kenlm.so

and still get the error about ‘-mfloat-abi=softfp’

SUBCOMMAND: # @jpeg//:jpeg [action 'Compiling external/jpeg/jquant2.c']
(cd /home/nvidia/data_1/bazel_cache/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/execroot/org_tensorflow && \
  exec env - \
    CUDA_TOOLKIT_PATH=/usr/local/cuda \
    CUDNN_INSTALL_PATH=/usr/lib/aarch64-linux-gnu \
    GCC_HOST_COMPILER_PATH=/usr/bin/gcc \
    LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64: \
    PATH=/usr/local/cuda-9.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin \
    PWD=/proc/self/cwd \
    PYTHON_BIN_PATH=/usr/bin/python \
    PYTHON_LIB_PATH=/usr/local/lib/python2.7/dist-packages \
    TF_CUDA_CLANG=0 \
    TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,3.7,5.2,5.3,6.0,6.1,6.2 \
    TF_CUDA_VERSION=9.0 \
    TF_CUDNN_VERSION=7.0.5 \
    TF_NEED_CUDA=1 \
    TF_NEED_OPENCL_SYCL=0 \
  external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTIFY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 -DNDEBUG -ffunction-sections -fdata-sections -MD -MF bazel-out/arm-opt/bin/external/jpeg/_objs/jpeg/external/jpeg/jquant2.pic.d -fPIC -iquote external/jpeg -iquote bazel-out/arm-opt/genfiles/external/jpeg -iquote external/bazel_tools -iquote bazel-out/arm-opt/genfiles/external/bazel_tools -DGEMMLOWP_ALLOW_SLOW_SCALAR_FALLBACK -O3 -O3 -w -D__ARM_NEON__ '-march=armv7-a' '-mfloat-abi=softfp' -fprefetch-loop-arrays -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -fno-canonical-system-headers -c external/jpeg/jquant2.c -o bazel-out/arm-opt/bin/external/jpeg/_objs/jpeg/external/jpeg/jquant2.pic.o)
ERROR: /home/nvidia/data_1/bazel_cache/_bazel_nvidia/6b9138338a6a5d153417b602388184c1/external/jpeg/BUILD:44:1: C++ compilation of rule '@jpeg//:jpeg' failed (Exit 1)
gcc: error: unrecognized command line option '-mfloat-abi=softfp'
Target //native_client:libctc_decoder_with_kenlm.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 499.665s, Critical Path: 101.23s
INFO: 518 processes: 518 local.
FAILED: Build did NOT complete successfully
bazel version
Build label: 0.15.2- (@non-git)
Build target: bazel-out/arm-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Jul 18 09:16:28 2018 (1531905388)
Build timestamp: 1531905388
Build timestamp as int: 1531905388

Please use Bazel 0.10 as TensorFlow recommends for r1.6. Also make sure to nuke /home/nvidia/data_1/bazel_cache/.

Hi, I did a HOWTO, to propose a solution who worked for me !

Hope it will help you
Vincent

Your solution works like charm for me too!

I very much appreciate @lissyx and @elpimous_robot responsivenesses!

2 Likes