Prebuild deep speech binary for tensorflow lite model on raspberry pi 3?

lissyx · May 17, 2019, 8:56am

That’s not really the kind of error I would have expected, it’s inconsistent: if you have those tools/arm_compiler/linaro-gcc72-armeabi/BUILD then there’s no reason it would not find the armv7 toolchain.

Only thing I can think of is weird state. Please make sure you start from a clean environment, you are using our TensorFlow and checkout the proper documented branch, use upstream documented Bazel version and avoid working as root.

chizhang · May 17, 2019, 9:09am

I’m quite sure the environment is clean and It’s your tensorflow fork repo. Checked out at origin/r1.13, using a bazel version 0.21.0 which should compatiable with tensorflow version. If it is not something I miss that causes the error, it is werid.

lissyx · May 17, 2019, 9:11am

TensorFlow CI uses Bazel 0.19.2. Could you just make sure you clean any bazel cache, do a configure from scratch and not as root ?

chizhang · May 17, 2019, 9:29am

Thanks a lot! Bazel 0.19.2 works .
Now the small hacking you mentioned above to build a tensorflow lite supported binary, could you give me some details?

lissyx · May 17, 2019, 9:50am

Nope, because I have not kept notes , but build should progress well and then you should have failures related to NEON stuff. Ping me again when that happens, and we’ll fix it publicly for others to try

dr0ptp4kt · August 22, 2019, 7:59pm

I managed to get this working on the 0.51 DeepSpeech tag with the corresponding TensorFlow 1.13 Mozilla fork the other day with a workaround and running it on a Raspberry Pi 4. I’m not within reach of my Pi 3 at the moment, although I would expect it to work there, too. It was markedly faster with TensorFlow Lite compared to te .pb and .pbmm. I did it on the 0.51 tag mainly so I could use the prebuilt .tflite model at v0.51 (and looking forward to 0.6 given Google's On-Device Speech Recognizer and https://github.com/mozilla/DeepSpeech/pull/2307).

See below for the workaround.

It was late then, and now I’m on a flaky connection and without sufficient power…so I haven’t gone back in and verified if the .cc files even need these edits (I had just been working against those as they seemed the best culprit from what I could piece together originally), but anyway the BUILD file definitely needed to be touched up for this workaround (it’s the stanza for the cc_library with name = “tensor_utils” part in the BUILD below).

I did this from a pristine Ubuntu 18.04 Docker container, not the GPU accelerated Dockerfile bundled with DeepSpeech (although I imagine that would work if you have an Nvidia GPU handy). By the way, here’s the thing in action on a Pi 4: https://www.icloud.com/sharedalbum/#B0B5ON9t3uAsJR . Like I say, it was late. Binding this for Python, readying the model for execution (instead of doing a full run top to bottom), and taking out artificial delays in the dialogue would make it run a bit faster.

Anyway, the diff.

(venv) root@d1d4b2c0eb07:/tensorflow# git diff
diff --git a/tensorflow/lite/kernels/internal/BUILD b/tensorflow/lite/kernels/internal/BUILD
index 4be3226938..7e2f66cc58 100644
--- a/tensorflow/lite/kernels/internal/BUILD
+++ b/tensorflow/lite/kernels/internal/BUILD
@@ -535,7 +535,7 @@ cc_library(
             ":neon_tensor_utils",
         ],
         "//conditions:default": [
-            ":portable_tensor_utils",
+            ":neon_tensor_utils",
         ],
     }),
 )
diff --git a/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h b/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h
index 8f52ef131d..780ae1da6c 100644
--- a/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h
+++ b/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h
@@ -24,9 +24,9 @@ limitations under the License.
 #endif
 
 #ifndef USE_NEON
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#if defined(__ARM_NEON__) || defined(__ARM_NEON)
 #define USE_NEON
-#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
 #endif  //  USE_NEON
 
 namespace tflite {
diff --git a/tensorflow/lite/kernels/internal/tensor_utils.cc b/tensorflow/lite/kernels/internal/tensor_utils.cc
index 701e5a66aa..21f2723c3b 100644
--- a/tensorflow/lite/kernels/internal/tensor_utils.cc
+++ b/tensorflow/lite/kernels/internal/tensor_utils.cc
@@ -16,9 +16,9 @@ limitations under the License.
 #include "tensorflow/lite/kernels/internal/common.h"
 
 #ifndef USE_NEON
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#if defined(__ARM_NEON__) || defined(__ARM_NEON)
 #define USE_NEON
-#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
 #endif  //  USE_NEON
 
 #ifdef USE_NEON
diff --git a/tensorflow/lite/tools/make/Makefile b/tensorflow/lite/tools/make/Makefile
index 994f660dba..805b699d23 100644
--- a/tensorflow/lite/tools/make/Makefile
+++ b/tensorflow/lite/tools/make/Makefile
@@ -1,3 +1,5 @@
+#!/bin/bash
+
 # Find where we're running from, so we can store generated files here.
 ifeq ($(origin MAKEFILE_DIR), undefined)
        MAKEFILE_DIR := $(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))

You should be able to bazel build TF and make the libraries.

bazel build --config=monolithic --config=rpi3 --config=rpi3_opt --define=runtime=tflite --config=noaws --config=nogcp --config=nohdfs --config=nokafka --config=noignite -c opt --copt=-O3 --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:generate_trie

In the DeepSpeech repo you may hit some errors and need to update the TFDIR path and the RASPBIAN path in native_client/[definitions.mk](http://definitions.mk. But you should be able to make the binaries.

make TARGET=rpi3 deepspeech

nmstoker · August 24, 2019, 10:09am

@dr0ptp4kt that’s really impressive especially considering the limited power of the RPi

lissyx · August 27, 2019, 10:02am

Yes, that’s nice, but nothing new to us. Even with TFLite runtime on RPi4 we are still unable to get real time factor close to 1. Considering that TFLite on those platforms requires build system hacks, that’s why we have decided to hold on using that feature. As documented here, it’s working.

dr0ptp4kt · September 3, 2019, 8:02pm

What I’m seeing on the RPi4 with native_client/python/client.py for TFLite is inference at 3.503s for the 3.966s arctic_a0024.wav file.

I’d for one be most thankful if the build were available, but I’m just one person. I was thinking to maybe make a fresh Dockerfile to help people with reproducing the build. Would it be okay if I posted something like that to the main repo or would you recommend maintaining a fork instead?

Is the hope to drive this down the inference time to subsecond or thereabouts (or like instantaneous as with the edgetpu?)? In the context of a conversational agent, there are all kinds of UX hacks that can compensate for the few seconds of waiting, but I gather you were hoping to be able to provide nearer realtime user feedback (which is of course more important in a number of other contexts).

lissyx · September 4, 2019, 7:02am

That’s going to be a waste of your time because we won’t take it

This is not what I’m seeing, can you document more your context ?

It’s possible there have been updates to RPI4 firmware / bootloader to improve perfs ?

The hope is to get transcription process faster than the audio comes in.

Yes, but shipping that involves a non-trivial amount of work, and we have a lot of other things to take care about at the moment. So, we need a good incentive to enable it.

The other alternative is moving all platforms (except CUDA) to TFLite runtime. But again, that’s non-trivial.

You can help by doing what you did: experimenting, giving feedback and documenting it.

lissyx · September 4, 2019, 7:07am

@dr0ptp4kt The other alternative is swapping TensorFlow runtime with TFLite runtime on ARMv7 and Aarch64 platforms. That involves less work, but it requires a few patches (yours is okay but it’s not the best way to handle it).

dr0ptp4kt · September 7, 2019, 12:57pm

Okay, here’s the context. It would be cool to have TFLite as the basis for those architectures, I agree! I was a little confused on the build system (even though it’s rather well done - nice work!), but if you’d like I’d be happy to try posting some patches. I think this will work on the 0.6 branch as well and reckon the Linaro cross-compilation could be optimized for buster, but anyway here’s the 0.5.1 version.

lissyx · September 8, 2019, 5:46pm

Thanks, the problem is not making those patches, I have TFLite builds for months locally, it’s just taking the decision. Balance between maintaining extra patches to the buildsystem VS speedup win.

alphac · September 13, 2019, 8:30am

Your work sounds like a great start, thanks! Does it use openCL or CPU? Just asking to know how much margin for optimization there might be.

lissyx · September 13, 2019, 5:53pm

You should not hope to use OpenCL on the RPi. I worked on that for weeks to test the status last year, and while the driver was (and is still) in good development status, our model was too complicated for it and neither the maintainer or myself could find time to start working on the blocking items.

lissyx · September 19, 2019, 10:06am

I would really like you to share more context, because I’m still not able to reproduce. This is on a RPi4, reinstalled right now, with the ice tower fan + heatspread:

pi@raspberrypi:~/ds $ for f in audio/*.wav; do echo $f; mediainfo $f | grep Duration; done;
audio/2830-3980-0043.wav
Duration                                 : 1 s 975 ms
Duration                                 : 1 s 975 ms
audio/4507-16021-0012.wav
Duration                                 : 2 s 735 ms
Duration                                 : 2 s 735 ms
audio/8455-210777-0068.wav
Duration                                 : 2 s 590 ms
Duration                                 : 2 s 590 ms
pi@raspberrypi:~/ds $ ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24553
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.38253
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.23032

So it’s consistent with the previous builds I did. Can you @dr0ptp4kt give more context on what you do ? How do you build / measure ?

dr0ptp4kt · September 19, 2019, 11:18am

Hi @lissyx! When I’m referring to inferencing, I’m talking about the inferencing-specific portion of the run with client.py.

I noticed that the GitHub link I posted looked like a simple fork link, but here’s the specific README.md that shows what I did:

https://github.com/dr0ptp4kt/DeepSpeech/blob/tflite-rpi-3and4-compat/rpi3and4/README.md

The LM even from a warmed up filesystem cache is taking 1.28s to load on this 4 GB RAM Pi 4. So when that’s subtracted from total run, that makes a significant percentage-wise difference. In a an end user application context, what I’d do is have that LM pre-injected before the intake of voice data so that the only thing the client has to do is the inferencing. Of course a 1.8 GB LM isn’t going to fit into RAM on a device with 1 GB of RAM, so there I think the only good option is to fiddle with the the size (and therefore quality) of the LM, TRIE, and .tflite model files appropriate to the use case.

I’m not telling you anything new here, but it’s also of course possible to offload error correction to the retrieval system. In my Wikipedia use case I might be contented for lower RAM scenarios to forego or dramatically shrink the LM and TRIE, increase the size of the .tflite for greater precision (because there would still be RAM space available), and use some sort of optimized forgiving topic embedding / fuzzy matching scheme in the retrieval system, effectively moving part of the problem to the later stage. It’s of course possible to move those improvements into the audio detection run with DeepSpeech itself, but in the context of this binary, it’s about managing the RAM in stages so that the LM and TRIE don’t spill over and page to disk.

Anyway, it looks like your run and my run are pretty close in terms of general speed - it’s really close to taking about the same time to process as the length of the clip (and the inferencing specific part seems to take less time).

For your product roadmap, is the hope to be as fast as the incoming audio for realtime processing or something of that nature? How much optimization do you want? I’m really interested in helping with that (through raw algorithms and smart hacks on LM / TRIE / .tflite) or even with build system stuff if you’re open to it - but I also know you need to manage the product roadmap as well, so don’t want to be too imposing!

Keep up the great work! If it would work for you I’d be happy to discuss on video (or Freenode if you prefer).

lissyx · September 19, 2019, 11:23am

I’m running without LM.

Ok, can you try with deepspeech C++ binary and the -t command line argument?

Those are mmap()'d, so it’s not really a big issue.

What do you mean?

dr0ptp4kt · September 19, 2019, 11:53am

Here’s what I’m seeing with -t. Funny I missed the flag earlier

Using the LM:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio arctic_a0024.wav -t

TensorFlow: v1.13.1-13-g174b4760eb

DeepSpeech: v0.5.1-0-g4b29b78

it was my reports from the north which chiefly induced people to buy

cpu_time_overall=3.25151

Not using the LM:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --audio arctic_a0024.wav -t
TensorFlow: v1.13.1-13-g174b4760eb
DeepSpeech: v0.5.1-0-g4b29b78
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=6.95059

So part of the speed is definitely actual use of the LM.

I agree with you that the mmap’ing on the .tflite diminishes the negative effect of disk reads. As for the LM, but it’s definitely faster when injected in RAM. Are you sure that’s being consumed in an mmap’d fashion? I know it should be possible to mmap read of course, but it seems like that thing is taking up some 40s on initial run - that seems longer than I would expect if it were doing filesystem segment seeks in an mmap fashion; maybe this 40s on the first read is just because the client is fully consuming the file whereas it could be made to only consume the pointer…I haven’t dug into that part of the code beyond a quick scan. Gotta run, but interested to hear if you have tips.

For the product roadmap, I mainly just wanted to ensure that if I’d be posting patches they’d be valuable to you and the general user base of DeepSpeech. I know it’s an open source project and I’m free to fork, but I was hoping if there are problems that can be solved that are mutually beneficial I work on those ones. I reckon the last thing you need is patches that aren’t aligned with where you’re taking this software. Specifically, I was wondering how much optimization you want in this Rpi4 context. I was thinking that if it would be helpful, I might be inclined to post patches to address optimization to the level you’re hoping for. As for the build system, I also would be happy to help with build scripts and that sort of thing (e.g., Bazel stuff, cutting differently sized versions of models, etc.) - not sure if you’d need to me get shell access and requisite paperwork for that, though, or if that’s off limits or just not helpful - I can appreciate just how hard build and deploy pipelines are. I realize taskcluster sort of runs on arbitrary systems, but it’s also the case that I don’t have a multi-GPU machine, so much of the capabilities when it comes to the full build pipeline or the assumptions of things even as simple as keystores sort of seem to break down on my local dev rig.

lissyx · September 19, 2019, 11:59am

I’m unsure exactly what you are suggesting here. You reported much faster inference than what we can achieve, so I’m trying to understand. Getting TFLite to run is not a problem, I’ve been experimenting with that for months now, so I know how to do it.

Maybe, but that’s not really what we are concerned about for now.

dr0ptp4kt:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --audio arctic_a0024.wav -t
TensorFlow: v1.13.1-13-g174b4760eb
DeepSpeech: v0.5.1-0-g4b29b78
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=6.95059

Could you please reproduce that with current master and using our audio files?

Also, could you please document what’s your basesystem ? Raspbian ? What’s your PSU ?