Prebuild deep speech binary for tensorflow lite model on raspberry pi 3?

Hi, I have exported my model to tensorflow lite format. But stuck on inferring as the prebuild deepspeech binary is for .pb model.
I noticed that does have USE_TFLITE flag to enable tflite model inference.Need I rebuild it? Anybody could give me some guidance?

So as of now, the build system only leverages TFLite on Android platform. There’s nothing stopping you from re-building for other system with TFLite enabled, except that it requires a bit of changes to the TensorFlow buildsystem. That’s why, for now, we settled not to do it.

So, before re-building the library, have you tried exporting as TensorFlow protobuf and not tflite? How does it works in your case?

Yes,I’ve tried protobuf format model.
In my case(chinese), I trained model from scratch with the parameters used in the release model]( , except changed n_hidden to 512 to get a smaller model.(I’m still working on the accuracy as it gets less than 70% on my test set).

The tests on raspberry 3b+ take about average 10s to process 5s audio clips.
As I am trying to make the model faster on light weight devices, I thought TFLite may help.

Close to what we see as well

Well it can, but you need (small) hacking but I don’t have it written down, so I can’t guide you as easily as I’d like. Do you have build setup working properly ? Are you able to reproduce a that works ?

Not really, meet with some troubles when cross-building rpi3 native client.
I followed the instructions here

But got some errors down below:

root@vultr:~/tensorflow# bazel build --config=monolithic --config=rpi3 --config=rpi3_opt -c opt --copt=-O3 --copt=-fvisibility=hidden
// //native_client:generate_trie
INFO: Invocation ID: 26719d85-e7e1-46b0-923c-3269a6b1f298
ERROR: /root/tensorflow/tools/arm_compiler/linaro-gcc72-armeabi/BUILD:12:1: in cc_toolchain_suite rule //tools/arm_compiler/linaro-gcc72-armeabi:toolchain: cc_toolchain_suite ‘//tools/arm_compiler/linaro-gcc72-armeabi:toolchain’ does not contain a toolchain for cpu ‘armv7’
ERROR: Analysis of target ‘//native_client:generate_trie’ failed; build aborted: Analysis of target ‘//tools/arm_compiler/linaro-gcc72-armeabi:toolchain’ failed; build aborted
INFO: Elapsed time: 34.792s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (18 packages loaded, 3838 targets configured)

I assume the instructions should download and config the cross-compile tools automatically, right?

what went wrong?

That’s not really the kind of error I would have expected, it’s inconsistent: if you have those tools/arm_compiler/linaro-gcc72-armeabi/BUILD then there’s no reason it would not find the armv7 toolchain.

Only thing I can think of is weird state. Please make sure you start from a clean environment, you are using our TensorFlow and checkout the proper documented branch, use upstream documented Bazel version and avoid working as root.

I’m quite sure the environment is clean and It’s your tensorflow fork repo. Checked out at origin/r1.13, using a bazel version 0.21.0 which should compatiable with tensorflow version. If it is not something I miss that causes the error, it is werid.

TensorFlow CI uses Bazel 0.19.2. Could you just make sure you clean any bazel cache, do a configure from scratch and not as root ?

Thanks a lot! Bazel 0.19.2 works .
Now the small hacking you mentioned above to build a tensorflow lite supported binary, could you give me some details?

Nope, because I have not kept notes :smiley:, but build should progress well and then you should have failures related to NEON stuff. Ping me again when that happens, and we’ll fix it publicly for others to try

1 Like

I managed to get this working on the 0.51 DeepSpeech tag with the corresponding TensorFlow 1.13 Mozilla fork the other day with a workaround and running it on a Raspberry Pi 4. I’m not within reach of my Pi 3 at the moment, although I would expect it to work there, too. It was markedly faster with TensorFlow Lite compared to te .pb and .pbmm. I did it on the 0.51 tag mainly so I could use the prebuilt .tflite model at v0.51 (and looking forward to 0.6 given Google's On-Device Speech Recognizer and

See below for the workaround.

It was late then, and now I’m on a flaky connection and without sufficient power…so I haven’t gone back in and verified if the .cc files even need these edits (I had just been working against those as they seemed the best culprit from what I could piece together originally), but anyway the BUILD file definitely needed to be touched up for this workaround (it’s the stanza for the cc_library with name = “tensor_utils” part in the BUILD below).

I did this from a pristine Ubuntu 18.04 Docker container, not the GPU accelerated Dockerfile bundled with DeepSpeech (although I imagine that would work if you have an Nvidia GPU handy). By the way, here’s the thing in action on a Pi 4: . Like I say, it was late. Binding this for Python, readying the model for execution (instead of doing a full run top to bottom), and taking out artificial delays in the dialogue would make it run a bit faster.

Anyway, the diff.

(venv) root@d1d4b2c0eb07:/tensorflow# git diff
diff --git a/tensorflow/lite/kernels/internal/BUILD b/tensorflow/lite/kernels/internal/BUILD
index 4be3226938..7e2f66cc58 100644
--- a/tensorflow/lite/kernels/internal/BUILD
+++ b/tensorflow/lite/kernels/internal/BUILD
@@ -535,7 +535,7 @@ cc_library(
         "//conditions:default": [
-            ":portable_tensor_utils",
+            ":neon_tensor_utils",
diff --git a/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h b/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h
index 8f52ef131d..780ae1da6c 100644
--- a/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h
+++ b/tensorflow/lite/kernels/internal/optimized/tensor_utils_impl.h
@@ -24,9 +24,9 @@ limitations under the License.
 #ifndef USE_NEON
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#if defined(__ARM_NEON__) || defined(__ARM_NEON)
 #define USE_NEON
-#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
 #endif  //  USE_NEON
 namespace tflite {
diff --git a/tensorflow/lite/kernels/internal/ b/tensorflow/lite/kernels/internal/
index 701e5a66aa..21f2723c3b 100644
--- a/tensorflow/lite/kernels/internal/
+++ b/tensorflow/lite/kernels/internal/
@@ -16,9 +16,9 @@ limitations under the License.
 #include "tensorflow/lite/kernels/internal/common.h"
 #ifndef USE_NEON
-#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#if defined(__ARM_NEON__) || defined(__ARM_NEON)
 #define USE_NEON
-#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
+//#endif  //  defined(__ARM_NEON__) || defined(__ARM_NEON)
 #endif  //  USE_NEON
 #ifdef USE_NEON
diff --git a/tensorflow/lite/tools/make/Makefile b/tensorflow/lite/tools/make/Makefile
index 994f660dba..805b699d23 100644
--- a/tensorflow/lite/tools/make/Makefile
+++ b/tensorflow/lite/tools/make/Makefile
@@ -1,3 +1,5 @@
 # Find where we're running from, so we can store generated files here.
 ifeq ($(origin MAKEFILE_DIR), undefined)
        MAKEFILE_DIR := $(shell dirname $(realpath $(lastword $(MAKEFILE_LIST))))

You should be able to bazel build TF and make the libraries.

bazel build --config=monolithic --config=rpi3 --config=rpi3_opt --define=runtime=tflite --config=noaws --config=nogcp --config=nohdfs --config=nokafka --config=noignite -c opt --copt=-O3 --copt=-fvisibility=hidden // //native_client:generate_trie

In the DeepSpeech repo you may hit some errors and need to update the TFDIR path and the RASPBIAN path in native_client/[]( But you should be able to make the binaries.

make TARGET=rpi3 deepspeech

@dr0ptp4kt that’s really impressive especially considering the limited power of the RPi

Yes, that’s nice, but nothing new to us. Even with TFLite runtime on RPi4 we are still unable to get real time factor close to 1. Considering that TFLite on those platforms requires build system hacks, that’s why we have decided to hold on using that feature. As documented here, it’s working.

What I’m seeing on the RPi4 with native_client/python/ for TFLite is inference at 3.503s for the 3.966s arctic_a0024.wav file.

I’d for one be most thankful if the build were available, but I’m just one person. I was thinking to maybe make a fresh Dockerfile to help people with reproducing the build. Would it be okay if I posted something like that to the main repo or would you recommend maintaining a fork instead?

Is the hope to drive this down the inference time to subsecond or thereabouts (or like instantaneous as with the edgetpu?)? In the context of a conversational agent, there are all kinds of UX hacks that can compensate for the few seconds of waiting, but I gather you were hoping to be able to provide nearer realtime user feedback (which is of course more important in a number of other contexts).

That’s going to be a waste of your time because we won’t take it

This is not what I’m seeing, can you document more your context ?

It’s possible there have been updates to RPI4 firmware / bootloader to improve perfs ?

The hope is to get transcription process faster than the audio comes in.

Yes, but shipping that involves a non-trivial amount of work, and we have a lot of other things to take care about at the moment. So, we need a good incentive to enable it.

The other alternative is moving all platforms (except CUDA) to TFLite runtime. But again, that’s non-trivial.

You can help by doing what you did: experimenting, giving feedback and documenting it.

@dr0ptp4kt The other alternative is swapping TensorFlow runtime with TFLite runtime on ARMv7 and Aarch64 platforms. That involves less work, but it requires a few patches (yours is okay but it’s not the best way to handle it).

Okay, here’s the context. It would be cool to have TFLite as the basis for those architectures, I agree! I was a little confused on the build system (even though it’s rather well done - nice work!), but if you’d like I’d be happy to try posting some patches. I think this will work on the 0.6 branch as well and reckon the Linaro cross-compilation could be optimized for buster, but anyway here’s the 0.5.1 version.

Thanks, the problem is not making those patches, I have TFLite builds for months locally, it’s just taking the decision. Balance between maintaining extra patches to the buildsystem VS speedup win.

Your work sounds like a great start, thanks! Does it use openCL or CPU? Just asking to know how much margin for optimization there might be.