Prebuild deep speech binary for tensorflow lite model on raspberry pi 3?

Yes, that’s nice, but nothing new to us. Even with TFLite runtime on RPi4 we are still unable to get real time factor close to 1. Considering that TFLite on those platforms requires build system hacks, that’s why we have decided to hold on using that feature. As documented here, it’s working.

What I’m seeing on the RPi4 with native_client/python/client.py for TFLite is inference at 3.503s for the 3.966s arctic_a0024.wav file.

I’d for one be most thankful if the build were available, but I’m just one person. I was thinking to maybe make a fresh Dockerfile to help people with reproducing the build. Would it be okay if I posted something like that to the main repo or would you recommend maintaining a fork instead?

Is the hope to drive this down the inference time to subsecond or thereabouts (or like instantaneous as with the edgetpu?)? In the context of a conversational agent, there are all kinds of UX hacks that can compensate for the few seconds of waiting, but I gather you were hoping to be able to provide nearer realtime user feedback (which is of course more important in a number of other contexts).

That’s going to be a waste of your time because we won’t take it

This is not what I’m seeing, can you document more your context ?

It’s possible there have been updates to RPI4 firmware / bootloader to improve perfs ?

The hope is to get transcription process faster than the audio comes in.

Yes, but shipping that involves a non-trivial amount of work, and we have a lot of other things to take care about at the moment. So, we need a good incentive to enable it.

The other alternative is moving all platforms (except CUDA) to TFLite runtime. But again, that’s non-trivial.

You can help by doing what you did: experimenting, giving feedback and documenting it.

@dr0ptp4kt The other alternative is swapping TensorFlow runtime with TFLite runtime on ARMv7 and Aarch64 platforms. That involves less work, but it requires a few patches (yours is okay but it’s not the best way to handle it).

Okay, here’s the context. It would be cool to have TFLite as the basis for those architectures, I agree! I was a little confused on the build system (even though it’s rather well done - nice work!), but if you’d like I’d be happy to try posting some patches. I think this will work on the 0.6 branch as well and reckon the Linaro cross-compilation could be optimized for buster, but anyway here’s the 0.5.1 version.

Thanks, the problem is not making those patches, I have TFLite builds for months locally, it’s just taking the decision. Balance between maintaining extra patches to the buildsystem VS speedup win.

Your work sounds like a great start, thanks! Does it use openCL or CPU? Just asking to know how much margin for optimization there might be.

You should not hope to use OpenCL on the RPi. I worked on that for weeks to test the status last year, and while the driver was (and is still) in good development status, our model was too complicated for it and neither the maintainer or myself could find time to start working on the blocking items.

I would really like you to share more context, because I’m still not able to reproduce. This is on a RPi4, reinstalled right now, with the ice tower fan + heatspread:

pi@raspberrypi:~/ds $ for f in audio/*.wav; do echo $f; mediainfo $f | grep Duration; done;
audio/2830-3980-0043.wav
Duration                                 : 1 s 975 ms
Duration                                 : 1 s 975 ms
audio/4507-16021-0012.wav
Duration                                 : 2 s 735 ms
Duration                                 : 2 s 735 ms
audio/8455-210777-0068.wav
Duration                                 : 2 s 590 ms
Duration                                 : 2 s 590 ms
pi@raspberrypi:~/ds $ ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24553
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.38253
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.23032

So it’s consistent with the previous builds I did. Can you @dr0ptp4kt give more context on what you do ? How do you build / measure ?

Hi @lissyx! When I’m referring to inferencing, I’m talking about the inferencing-specific portion of the run with client.py.

I noticed that the GitHub link I posted looked like a simple fork link, but here’s the specific README.md that shows what I did:

https://github.com/dr0ptp4kt/DeepSpeech/blob/tflite-rpi-3and4-compat/rpi3and4/README.md

The LM even from a warmed up filesystem cache is taking 1.28s to load on this 4 GB RAM Pi 4. So when that’s subtracted from total run, that makes a significant percentage-wise difference. In a an end user application context, what I’d do is have that LM pre-injected before the intake of voice data so that the only thing the client has to do is the inferencing. Of course a 1.8 GB LM isn’t going to fit into RAM on a device with 1 GB of RAM, so there I think the only good option is to fiddle with the the size (and therefore quality) of the LM, TRIE, and .tflite model files appropriate to the use case.

I’m not telling you anything new here, but it’s also of course possible to offload error correction to the retrieval system. In my Wikipedia use case I might be contented for lower RAM scenarios to forego or dramatically shrink the LM and TRIE, increase the size of the .tflite for greater precision (because there would still be RAM space available), and use some sort of optimized forgiving topic embedding / fuzzy matching scheme in the retrieval system, effectively moving part of the problem to the later stage. It’s of course possible to move those improvements into the audio detection run with DeepSpeech itself, but in the context of this binary, it’s about managing the RAM in stages so that the LM and TRIE don’t spill over and page to disk.

Anyway, it looks like your run and my run are pretty close in terms of general speed - it’s really close to taking about the same time to process as the length of the clip (and the inferencing specific part seems to take less time).

For your product roadmap, is the hope to be as fast as the incoming audio for realtime processing or something of that nature? How much optimization do you want? I’m really interested in helping with that (through raw algorithms and smart hacks on LM / TRIE / .tflite) or even with build system stuff if you’re open to it - but I also know you need to manage the product roadmap as well, so don’t want to be too imposing!

Keep up the great work! If it would work for you I’d be happy to discuss on video (or Freenode if you prefer).

I’m running without LM.

Ok, can you try with deepspeech C++ binary and the -t command line argument?

Those are mmap()'d, so it’s not really a big issue.

What do you mean?

Here’s what I’m seeing with -t. Funny I missed the flag earlier :stuck_out_tongue:

Using the LM:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio arctic_a0024.wav -t

TensorFlow: v1.13.1-13-g174b4760eb

DeepSpeech: v0.5.1-0-g4b29b78

it was my reports from the north which chiefly induced people to buy

cpu_time_overall=3.25151

Not using the LM:

$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --audio arctic_a0024.wav -t
TensorFlow: v1.13.1-13-g174b4760eb
DeepSpeech: v0.5.1-0-g4b29b78
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=6.95059

So part of the speed is definitely actual use of the LM.

I agree with you that the mmap’ing on the .tflite diminishes the negative effect of disk reads. As for the LM, but it’s definitely faster when injected in RAM. Are you sure that’s being consumed in an mmap’d fashion? I know it should be possible to mmap read of course, but it seems like that thing is taking up some 40s on initial run - that seems longer than I would expect if it were doing filesystem segment seeks in an mmap fashion; maybe this 40s on the first read is just because the client is fully consuming the file whereas it could be made to only consume the pointer…I haven’t dug into that part of the code beyond a quick scan. Gotta run, but interested to hear if you have tips.

For the product roadmap, I mainly just wanted to ensure that if I’d be posting patches they’d be valuable to you and the general user base of DeepSpeech. I know it’s an open source project and I’m free to fork, but I was hoping if there are problems that can be solved that are mutually beneficial I work on those ones. I reckon the last thing you need is patches that aren’t aligned with where you’re taking this software. Specifically, I was wondering how much optimization you want in this Rpi4 context. I was thinking that if it would be helpful, I might be inclined to post patches to address optimization to the level you’re hoping for. As for the build system, I also would be happy to help with build scripts and that sort of thing (e.g., Bazel stuff, cutting differently sized versions of models, etc.) - not sure if you’d need to me get shell access and requisite paperwork for that, though, or if that’s off limits or just not helpful - I can appreciate just how hard build and deploy pipelines are. I realize taskcluster sort of runs on arbitrary systems, but it’s also the case that I don’t have a multi-GPU machine, so much of the capabilities when it comes to the full build pipeline or the assumptions of things even as simple as keystores sort of seem to break down on my local dev rig.

I’m unsure exactly what you are suggesting here. You reported much faster inference than what we can achieve, so I’m trying to understand. Getting TFLite to run is not a problem, I’ve been experimenting with that for months now, so I know how to do it.

Maybe, but that’s not really what we are concerned about for now.

Could you please reproduce that with current master and using our audio files?

Also, could you please document what’s your basesystem ? Raspbian ? What’s your PSU ?

I’ll reproduce with the current master and share. I may be a bit occupied the next several days, just a heads up.

It’s stock Raspbian for the Raspberry Pi 4, using the Raspberry Pi official USB-C power adapter. I have a metal heatsink on the CPU and modem (heatsink for modem doesn’t fit on GPU or I’d put it there!) but no other thermal stuff going, no overclocking. Pretty normal setup.

@lissyx I found some time tonight to build this against master using bazel 0.24.1. I used the multistrap conf file for Stretch instead of Buster just to hold that variable constant. FWIW, I had actually produced a Buster build ~2 weeks ago for v.051 and saw similar performance between the Stretch and Buster builds at that time.

I just re-used the .tflite and lm.binary from the v0.51 model archive as you might infer from the directory names in the second run below. From my Mac I generated a new trie file for v4 instead of v3, then SCP’d that over to the Pi for use in the second run below (the faster one that uses the LM and TRIE)

Anyway, here’s what I’m seeing.

Without LM and TRIE. Quite similar in speed compared to the results you pasted in.

pi@pi4:~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.16360
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.08172
> audio//arctic_a0024.wav
it was my reports from the northwhich chiefly induced people to buy
cpu_time_overall=4.89014
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.30461

With LM and TRIE. Faster. The nice news here is that it appears that for arctic_a0024.wav it’s even faster than what I got with the v0.51 tflite build.

~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --lm ~/ds/deepspeech-0.5.1-models/lm.binary --trie trie --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=2.10253
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=2.07445
> audio//arctic_a0024.wav
it was my reports from the north which chiefly induced people to buy
cpu_time_overall=3.17514
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=1.55579

Well if you run against a RPi4, you have Buster. It’d be more correct to use Buster when building, even though it should not make any difference.

All in all, we have the same setup, at least.

I’m starting to wonder if we have not regressed the way we measure time.

Ok, adding the LM, I’m getting similar results:

pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=3.24919
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=2.37936
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=3.20311

real    0m8.877s
user    0m8.781s
sys     0m0.091s
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=2.14202
> audio//2830-3980-0043.wav
experienced proof less
cpu_time_overall=1.59947
> audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=2.15120

real    0m6.810s
user    0m5.890s
sys     0m0.920s
pi@raspberrypi:~/ds $ 

Okay, I know what’s happening. When we don’t load a LM, we won’t set an external scorer and thus this is not executed: https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L39-L44

The DS_SpeechToText* implementation will, under the hood, rely on the Streaming API and this means StreamingState::processBatch() gets computed a few times. That will call decoder_state_.next(): https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/deepspeech.cc#L253-L255

Obivously, with the LM and the trie, the beam search is faster.

This is confirmed when checking execution time around .next().

Here, without LM:

pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie --extended -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-5-g5845505
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
ds_createstream_time=0.00002
decoder_state_time=0.00501
decoder_state_time=0.00368
decoder_state_time=0.02743                   
decoder_state_time=0.04064     
decoder_state_time=0.02312                                                   
decoder_state_time=0.03033
decoder_state_time=0.04554
decoder_state_time=0.01943
decoder_state_time=0.01434
ds_create_time=1.59941 ds_finish_time=2.11518
why should one halt on the way
cpu_time_overall=2.11524 cpu_time_decoding=0.08359 cpu_time_decodeall=0.08360
> audio//2830-3980-0043.wav
ds_createstream_time=0.00001
decoder_state_time=0.00450
decoder_state_time=0.00636
decoder_state_time=0.01696
decoder_state_time=0.03406
decoder_state_time=0.02439
decoder_state_time=0.00506
decoder_state_time=0.00079
ds_create_time=1.09895 ds_finish_time=1.58255
experienced proof less
cpu_time_overall=1.58259 cpu_time_decoding=0.07956 cpu_time_decodeall=0.07957
> audio//8455-210777-0068.wav
ds_createstream_time=0.00001
decoder_state_time=0.00414
decoder_state_time=0.00451
decoder_state_time=0.02432
decoder_state_time=0.05007
decoder_state_time=0.04753
decoder_state_time=0.03541
decoder_state_time=0.04675
decoder_state_time=0.00631
decoder_state_time=0.00084
ds_create_time=1.62787 ds_finish_time=2.12236
your power is sufficient i said
cpu_time_overall=2.12242 cpu_time_decoding=0.08898 cpu_time_decodeall=0.08899

real    0m6.732s
user    0m5.850s
sys     0m0.882s

And with LM:

pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --extended -t
TensorFlow: v1.14.0-14-g1aad02a78e
DeepSpeech: v0.6.0-alpha.6-5-g5845505
INFO: Initialized TensorFlow Lite runtime.
Running on directory audio/
> audio//4507-16021-0012.wav
ds_createstream_time=0.00008
decoder_state_time=0.09759
decoder_state_time=0.11414
decoder_state_time=0.13162
decoder_state_time=0.15993
decoder_state_time=0.16111
decoder_state_time=0.14577
decoder_state_time=0.16027
decoder_state_time=0.16672
decoder_state_time=0.09356
ds_create_time=2.42510 ds_finish_time=3.16191
why should one halt on the way
cpu_time_overall=3.16199 cpu_time_decoding=0.07662 cpu_time_decodeall=0.07662
> audio//2830-3980-0043.wav
ds_createstream_time=0.00001
decoder_state_time=0.09842
decoder_state_time=0.11592
decoder_state_time=0.13760
decoder_state_time=0.16061
decoder_state_time=0.15751
decoder_state_time=0.14751
decoder_state_time=0.02784
ds_create_time=1.68183 ds_finish_time=2.32753
experienced proof less
cpu_time_overall=2.32759 cpu_time_decoding=0.07175 cpu_time_decodeall=0.07176
> audio//8455-210777-0068.wav
ds_createstream_time=0.00001
decoder_state_time=0.10290
decoder_state_time=0.11246
decoder_state_time=0.13637
decoder_state_time=0.15481
decoder_state_time=0.16675
decoder_state_time=0.18728
decoder_state_time=0.19037
decoder_state_time=0.18308
decoder_state_time=0.01142
ds_create_time=2.46809 ds_finish_time=3.14513
your power is sufficient i said
cpu_time_overall=3.14518 cpu_time_decoding=0.08011 cpu_time_decodeall=0.08012

real    0m8.674s
user    0m8.594s
sys     0m0.081s

This does account for the difference of execution time.

@dr0ptp4kt Thanks for sharing your experiments, I’ve taken the bad habit to test without LM and did not pay attention enough when I early tested on the RPi4 that LM would have such an impact. Without your feedback, I would not have checked deeper. I think we’re going to switch to TFLite runtime for v0.6 on ARMv7 builds for RPi3/RPi4, and thus instruct people that they should be able to get faster than realtime on RPi4.

As much as I could test on Aarch64 boards we have (LePotato, S905X, https://libre.computer/products/boards/aml-s905x-cc/) the situation still holds and the SoC is not powerful enough for those decent performances.

I did investigate that, and found that we could tune the way the LM is loaded. Current master has a PR that improves things. On RPi4, the latency improves nicely. On Android devices, it’s not even visible as much as I could test :-).