DeepSpeech native client compilation for Asus Thinkerboard

Hi,

I am trying to run a native client in an Asus Thinkerboard card that has an architecture similar to Raspberry Pi3 (armv7l 32 bit).
But I am a bit stuck now.

The steps I followed are:

  1. Create a clean OS SD with ThinkerOS (Debian), install Miniconda3 (because some python packages are available without compilation there), create a conda environment deep-spech with python 2.7.

  2. Install DeepSpeech with the instructions from README.md, except for tensorflow that has to be compiled because no package is availabe neither in pip nor in conda, and anyway I need the compilation for native client.
    Obviously the download native_client from taskcluster is not working, because it is the linux 64bit one.

  3. Compile bazel an tensorflow from scratch with these instructions:
    https://github.com/samjabrahams/tensorflow-on-raspberry-pi/blob/master/GUIDE.md
    WARNING: tensorflow code is retrieved from mozilla/tensorflow not from tensorflow site

  4. Compile DeepSpeech native_client with the instructions here (not language bindings, just custom decoder):
    https://github.com/mozilla/DeepSpeech/blob/23c8dcffcf9337c394301d2756976b234729cc9b/native_client/README.md
    NOTE: those steps were made in both cards, Thinkerboard and RaspberryPi3

  5. Finally, try to run a pretrained toy spanish model (that I have used before in my Mac with success) with native client and some test wav files
    This same error appears in both cards RP3 and Thinker:
    Invalid argument: No OpKernel was registered to support Op ā€˜SparseToDenseā€™ with these attrs. Registered devices: [CPU], Registered kernels:
    device=ā€˜CPUā€™; T in [DT_STRING]; Tindices in [DT_INT64]
    device=ā€˜CPUā€™; T in [DT_STRING]; Tindices in [DT_INT32]
    device=ā€˜CPUā€™; T in [DT_BOOL]; Tindices in [DT_INT64]
    device=ā€˜CPUā€™; T in [DT_BOOL]; Tindices in [DT_INT32]
    device=ā€˜CPUā€™; T in [DT_FLOAT]; Tindices in [DT_INT64]
    device=ā€˜CPUā€™; T in [DT_FLOAT]; Tindices in [DT_INT32]
    device=ā€˜CPUā€™; T in [DT_INT32]; Tindices in [DT_INT64]
    device=ā€˜CPUā€™; T in [DT_INT32]; Tindices in [DT_INT32]
    [[Node: SparseToDense = SparseToDense[T=DT_INT64, Tindices=DT_INT64, validate_indices=true](CTCBeamSearchDecoder, CTCBeamSearchDecoder:2, CTCBeamSearchDecoder:1, SparseToDense/default_value)]]

  6. I found this post Error with sample model on Raspbian Jessie
    And I download the precompiled raspberry libraries from here: https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.master.arm/artifacts/public/native_client.tar.xz
    Those libraries do not include libctc_decoder_with_kenlm.so, and I kept the compiled one I had.

  7. With the raspberry libraries, the model at the raspberry card is working FINE :-), but the thinkerboard throws a new error:
    Thread 1 ā€œdeepspeechā€ received signal SIGILL, Illegal instruction.
    0xb692de84 in tensorflow::(anonymous namespace)::GraphConstructor::TryImport() () from /home/ftx/fonotexto/herramientas/DeepSpeech/libdeepspeech.so

  8. I run out of ideas and post a question to you to get any new hint that can unblock me.

This is the overview of the history, if you need additional details let me know.
Thanks a lot for your help,
Mar

Just use our tooling, I know nothing about https://github.com/samjabrahams/tensorflow-on-raspberry-pi/blob/master/GUIDE.md but obviously itā€™s not good.

From our mozilla/tensorflow checkout, use r1.5 branch with master for mozilla/DeepSpeech. To build for RPi3, just use --config=rpi3 on the bazel build command-line.

If you donā€™t use --config=rpi3 it will not be properly configured to use Bazelā€™s RPi3 toolchain definition, that includes the -DRASPBERRY_PI flag neded for SparseToDense and others to behave properly.

Hi again,

The tensorflow sources I am using are these:
https://github.com/mozilla/tensorflow/
should I use these ones?:
https://github.com/mozilla/tensorflow/tree/r1.5

I have tried to follow your hint, without success :-(.

  1. First of all I tried to compile inside the Thinkerboard card itself with --config=rpi3 param:

bazel build --config=monolithic --config=rpi3 -c opt --copt=-O3 --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:deepspeech_utils //native_client:libctc_decoder_with_kenlm.so //native_client:generate_trie

and end up with this error:

tools/arm_compiler/gcc_arm_rpi/arm-linux-gnueabihf-gcc: line 3: /proc/self/cwd/external/GccArmRpi/arm-bcm2708/arm-rpi-4.9.3-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc: cannot execute binary file: Exec format error

Because the tensorflow toolchain downloads a 64-bit compiler from to do the job ???, so I assumed that it is designed for cross-compiling.

  1. Then I tried to cross-compile from the Mac using the --config=mp3

bazel build --config=monolithic -c opt --copt=-O3 --config=rpi3 --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:deepspeech_utils //native_client:libctc_decoder_with_kenlm.so //native_client:generate_trie

The process ended up ok, but the generated binaries are 64-bit as well, so unable to be used.

So, any other hint? Maybe I should set something special in the tensorflow ./configure about this option ā€œā€“config=optā€ is specified [Default is -march=native]:

I would like to compile DeepSpeech inside the Thinkerboard, because your precompiled raspberry libraries do the ā€œIllegal instructionā€ error.
Is there any alternative of compiling without bazel toolchain, just using make?

Thanks,
Mar

If you compile on the board itself, itā€™s going to be slow, but it should work: donā€™t use --config=rpi3 if you want to do that.

You are right that this is cross-compilation but this is designed and tested only from linux host, not from Mac: the downloaded toolchain is linux binaries (official RPi foundation toolchain), I have no idea what you might get outside of that.

But cross-compiling with --config=rpi3 will get you (hopefully) the same binaries, so it might be failing the same way. Ideally you should add a cross-compilation target in tools/arm_compiler/CROSSTOOL.

Looking at the specs: https://en.wikipedia.org/wiki/Asus_Tinker_Board itā€™s not that close to RPi3, which is an ARM Cortex-A53 while the Asus is RK3288 Cortex-A17. So I donā€™t really understand how you might expect RPi3 binaries to run by default :).

As a quick hack, you can change the compiler_flag sections in the above CROSSTOOL file, in the toolchain section identified by gcc_rpi_linux_armhf, and specifically the -mtune and -mfpu to adapt to your system. Iā€™m unsure whether the GCC provided for the RPi3 toolchain will be good enough in your case.

If thatā€™s not working, then your best (but sloooooow) option is in-situ compilation, but just DONT add --config=rpi3 there.

Yes, I was very naive thinking your precompiled raspberry libraries could work in this architecture :frowning:
I will try to follow these new instructions, Thanks a lot.

Since we explicitely target Raspbian distro and Cortex-A53 architecture of the RPi3, it would be really surprising that it works, in fact.

There might be a third solution of using the TensorFlowā€™s cross-compilation bits that they landed after I did it, but I have not explored how it works precisely, so I cannot recommend or guide you on that for now.

Iā€™d really suggest sticking to cross-compilation, though, since itā€™s much much faster, so even if you have to trial/error, in the end itā€™s likely to be faster to get something working. In-situ compilation will harm you memory, (might) require extra setup for swap etc.

Hi,

I finally did a make compilation inside the card without ā€œbazelā€ but ā€œmakeā€.

  1. Install a pendrive swap area (3Gb) to speed up things.

  2. Compile tensorflow using these directives for RPi3 but adding the extra parameter ANDROID_TYPES=-D__ANDROID_TYPES_FULL__ to the compilation line:

Yes this takes a long time :confounded: (about 1-2 hours) with -j4.

  1. Compile deepspeech (libraries and binary) doing custom makefile for the libraries following the bazel instructions and the information in BUILD your file, including the objects from tensorflow compilation.

And it works fine :slight_smile: .

Thanks a lot for your help!
Mar

1 Like

Hello Mar,
I am currently working on a project where I try to use Deep Speech on the Asus TinkerBoard too. But as I am totally stuck and not really able to follow your instructions successfully, I want to ask if you could provide a more detailed guide for the whole process of installing the prerequisites and Deep Speech itself on it?
Iā€™m trying really hard to get into the whole thing but am just a little overwhelmed with the overall effort I guess.

Nobody can help you if you donā€™t explain what is blocking you. There are no build tailored for your board yet, but there should be ways to perform the build.

Please have a look at the current codebase. Iā€™ve moved away from the Raspberry Pi toolchain, and decoupled things between cross-compilation of ARM and RPi3. Check tools/bazel.rc: https://github.com/mozilla/tensorflow/blob/master/tools/bazel.rc#L52-L56 you can get inspiration from there, make your own build:X lines and use that on bazel build --config=X.

There should be no need for ANDROID_TYPES=-D__ANDROID_TYPES_FULL__, this is already taken care of by the use of -copt=-DRASPBERRY_PI.

That should save you the hassle of using makefiles.

Hi,

Do you try to use cross-compiling as lissyx states or native gcc compilation as I did?.
It would be nice to know where exactly are you stuck.

The cross-compilation is a better and simpler idea, but anyway I can give you more detailed instructions:

  1. Deploy a JESSIE Debian OS distribution in the SD card, you need gcc 4.8 so DO NOT USE STRETCH Debian.

    http://dlcdnet.asus.com/pub/ASUS/mb/Linux/Tinker_Board_2GB/20170330-tinker-board-linaro-jessie-alip-v16.zip

    WARNING1: be careful with the system date-time in the AsusTK, configure it correctly before continuing
    WARNING2: Do NOT execute ā€œsudo apt upgradeā€, the libnettle6 will be installed and some transference tools will stop working

  #install some prerequisites

sudo apt-get install apt-utils
sudo apt-get update
sudo apt-get install -y pkg-config
sudo apt-get install -y autoconf automake libtool gcc-4.8 g+Ā±4.8
sudo apt-get install -y swig build-essential libssl-dev libffi-dev libgmp3-dev
sudo apt-get install -y openjdk-8-jdk libeigen3-dev
sudo apt-get install -y curl git wget gawk zip zlib1g-dev unzip
sudo apt-get install -y sox libsox-dev

  1. Create swap area inserting a pendrive with 3Gb or more of free space

sudo blkid #get device, p.e. /dev/sda1
sudo umount /dev/sda1
sudo mkswap /dev/sda1 #get UUIDcode
sudo vi /etc/fstab #insert a new line:
UUID=UUIDcode none swap sw,pri=5 0 0
sudo swapon -a

  1. Follow Tensorflow instructions to compile natively IN the AsusTK card

git clone https://github.com/mozilla/tensorflow
cd tensorflow

vi ./tensorflow/contrib/makefile/tf_op_files.txt
#add these libraries to get compiled too for DeepSpeech
tensorflow/core/ops/bitwise_ops.cc
tensorflow/core/ops/lookup_ops.cc
./tensorflow/contrib/makefile/download_dependencies.sh

  #protobuf compilation

cd tensorflow/contrib/makefile/downloads/protobuf/
./autogen.sh
./configure
make
sudo make install
sudo ldconfig # refresh shared library cache
cd -

  #nsync environment and compilation	(NOTE1)

export HOST_NSYNC_LIB=tensorflow/contrib/makefile/compile_nsync.sh
export TARGET_NSYNC_LIB="$HOST_NSYNC_LIB"

  #tensorflow itself

make -j4 -f tensorflow/contrib/makefile/Makefile HOST_OS=PI TARGET=PI ANDROID_TYPES=-D__ANDROID_TYPES_FULL__
OPTFLAGS="-Os -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize" CXX=g+Ā±4.8
#(wait for 1-2 hours)

  #at the end you get the tensorflow objects here, 

tensorflow/contrib/makefile/downloads
tensorflow/contrib/makefile/gen
#including the tensorflow.a library with ALL the libraries you need
tensorflow/contrib/makefile/gen/lib/libtensorflow-core.a

  #do the symlink as instructed by Mozilla team

ln -s ā€¦/DeepSpeech/native_client ./
cd ā€¦

  1. Finally DeepSpeech

git clone https://github.com/mozilla/DeepSpeech
cd DeepSpeech

  #Now you have to manually modify the original Mozilla Makefile  to add TF libraries and includes to the compilation of deepspeech native client binary, there are many ways to do it.
  #add these include paths (if I miss any, the compilation will fail and the error will guide you)
  	TFDIR ?= $(abspath $(MAKEFILE_DIR)/../../tensorflow) 
  	${TFDIR}/tensorflow/contrib/makefile/downloads
  	${TFDIR}/tensorflow/contrib/makefile/downloads/eigen
  	${TFDIR}/tensorflow/contrib/makefile/downloads/gemmlowp
  	${TFDIR}/tensorflow/contrib/makefile/downloads/nsync/public
  	${TFDIR}/tensorflow/contrib/makefile/downloads/fft2d
  	${TFDIR}/tensorflow/contrib/makefile/gen/proto
  	${TFDIR}/tensorflow/contrib/makefile/gen/host_obj
  	${TFDIR}/tensorflow/contrib/makefile/gen/protobuf-host/include
  #add these extra library paths, this is only an example you can do it your way:
  	TF_GENDIR := ${TFDIR}/tensorflow/contrib/makefile/gen 
  	TF_LIBPATH := ${TFDIR}/native_client/gen/lib 
  	TF_NSYNCPATH := ${TFDIR}/$(dir ${TARGET_NSYNC_LIB})
  		#in the compilation line:
  	-L$(TF_NSYNCPATH) -L$(TF_LIBPATH) -L$(MAKEFILE_DIR)
  #add these libraries to deepspeech binary in the compilation line as well:
  	-lnsync -lstdc++ -lprotobuf -lz -lm -ldl -lpthread -llibtensorflow-core

  #Then execute the compilation + link
  	make 

And then you will get the deepspeech binary, as well as generate_trie and the .so libraries.

WARNING1: be careful having the paths correctly set where the libraries are placed, you can use symbolic links or real paths, as your usually do with your compilations
WARNING2: before making DeepSpeech be sure that the NSYNC environment is properly set in the same terminal (look at (NOTE1))
WARNING3: the libtensorflow-core.a is a HUGE lib and will make a maybe too big binary, you can link with required individual .o files at ensorflow/contrib/makefile/gen instead, to get a smaller deepspeech bin

Thats all, I hope it helps.
Mar

2 Likes

Nice, but you should use --config=monilithic and the proper visibility flags as we do :slight_smile:

Thank you so very much! This added quite some perspective to my approach.

I followed your instructions successfully until

#tensorflow itself
make -j4 -f tensorflow/contrib/makefile/Makefile HOST_OS=PI TARGET=PI ANDROID_TYPES=-D__ANDROID_TYPES_FULL__ 
OPTFLAGS="-Os -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize" CXX=g++-4.8

The process started running and did for a very little while, then it threw the following:

...
/home/linaro/tensorflow/tensorflow/contrib/makefile/gen/host_obj/tensorflow/core/example/feature.pb.o /home/linaro/tensorflow/tensorflow/contrib/makefile/gen/host_obj/tensorflow/core/example/example.pb.o /home/linaro/tensorflow/tensorflow/contrib/makefile/gen/host_obj/tensorflow/core/grappler/costs/op_performance_data.pb.o  -L/usr/local/lib tensorflow/contrib/makefile/compile_nsync.sh -lstdc++ -lprotobuf -lpthread -lm -lz -ldl -lpthread
tensorflow/contrib/makefile/compile_nsync.sh: file not recognized: File format not recognized
collect2: error: ld returned 1 exit status
tensorflow/contrib/makefile/Makefile:808: recipe for target '/home/linaro/tensorflow/tensorflow/contrib/makefile/gen/host_bin/proto_text' failed
make: *** [/home/linaro/tensorflow/tensorflow/contrib/makefile/gen/host_bin/proto_text] Error 1

Following the somewhat matching advice here i ran compile_nsync.sh and replaced the exports with

export HOST_NSYNC_LIB=tensorflow/contrib/makefile/downloads/nsync/builds/default.linux.c++11/nsync.a
export TARGET_NSYNC_LIB="$HOST_NSYNC_LIB"

which made the whole process run seemingly.

Now that last part of your instructions seems to be all Greek to me (but that might just be the time of day :smiley: ), so it will be on tomorrows todo-list. Iā€™ll report back again how it went when i got to it. Meanwhile, thank you again!

P.S.: Not sure the g+Ā±4.8 in your instructions but i replaced them with g++-4.8
P.P.S.: And just now i notice that conflation of + and - seems to be a feature of this editor which can be masked out.

I would really suggest you @mar_martinez and @rps to cross-build using Bazel and following our docs, because the way you are building and linking is not something we do support, and this might lead to unexpected behaviors.

Thanks for the suggestion and taking care!
At the moment Iā€™m just trying to make it run in any way to get an impression of the general possible performance using this hardware. If I should get more serious with it, Iā€™ll have a closer look at Bazel, which is still new to me atm.

Sure, but for example you are not using the same GCC version and optimization flags as us, so your testing might be biased (in a good or in a bad way) :frowning:

I understand and that concerns me. Probably weā€™ll have some (naive) comparison on that soon.

1 Like

I totally agree, and will move to cross-compilation as soon as I can

1 Like

Thanks! I donā€™t know if there would be any value in more ā€œgenericā€ ARMv7 binaries ; instead of building with -mtune=cortex-a53 that matches RPi3.

FIY, Iā€™m in the process of switching back to GCC 4.9 from Linaro, because somehow anything above 5 gets a bullet in the head while running tests on ARMv7 and ARMv8 hardware, on NodeJS v8 and NodeJS v9. Much of my investigations on that this week find a lot of noise around recent GCC versions, C++11 ABI enabling, leading to similar invalid pointer errors. Now, thatā€™s the only element blocking me from being able to move on merging RPi3 TaskCluster testing, so I decided to switch back the versions.

One of the bugs I found was https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68073 and another is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82172. Now, Iā€™m not 100% sure itā€™s the same issue we face, but the thing is, I do have different behavior in the way code is crashing whether or not I do enable or disable that same macro.

Iā€™ve tried with valgrind on ARM64 hardware, and the C++ and Python code could never ever hit any bad memory access. For some reason, valgrind seems broken at least on my RPi3 Raspbian Stretch install, but building with Linaro GCC 6.3 and AddressSanitizer, the same testing of C++ client could not reveal any bad memory access.

So, while I donā€™t have hard proof, it seems like close enough to consider we might not be at fault here: likely itā€™s not our way of cross-compiling that is at fault, nor the deepspeech specific code.