Yes, that’s nice, but nothing new to us. Even with TFLite runtime on RPi4 we are still unable to get real time factor close to 1. Considering that TFLite on those platforms requires build system hacks, that’s why we have decided to hold on using that feature. As documented here, it’s working.
What I’m seeing on the RPi4 with native_client/python/client.py for TFLite is inference at 3.503s for the 3.966s arctic_a0024.wav file.
I’d for one be most thankful if the build were available, but I’m just one person. I was thinking to maybe make a fresh Dockerfile to help people with reproducing the build. Would it be okay if I posted something like that to the main repo or would you recommend maintaining a fork instead?
Is the hope to drive this down the inference time to subsecond or thereabouts (or like instantaneous as with the edgetpu?)? In the context of a conversational agent, there are all kinds of UX hacks that can compensate for the few seconds of waiting, but I gather you were hoping to be able to provide nearer realtime user feedback (which is of course more important in a number of other contexts).
That’s going to be a waste of your time because we won’t take it
This is not what I’m seeing, can you document more your context ?
It’s possible there have been updates to RPI4 firmware / bootloader to improve perfs ?
The hope is to get transcription process faster than the audio comes in.
Yes, but shipping that involves a non-trivial amount of work, and we have a lot of other things to take care about at the moment. So, we need a good incentive to enable it.
The other alternative is moving all platforms (except CUDA) to TFLite runtime. But again, that’s non-trivial.
You can help by doing what you did: experimenting, giving feedback and documenting it.
@dr0ptp4kt The other alternative is swapping TensorFlow runtime with TFLite runtime on ARMv7 and Aarch64 platforms. That involves less work, but it requires a few patches (yours is okay but it’s not the best way to handle it).
Okay, here’s the context. It would be cool to have TFLite as the basis for those architectures, I agree! I was a little confused on the build system (even though it’s rather well done - nice work!), but if you’d like I’d be happy to try posting some patches. I think this will work on the 0.6 branch as well and reckon the Linaro cross-compilation could be optimized for buster, but anyway here’s the 0.5.1 version.
Thanks, the problem is not making those patches, I have TFLite builds for months locally, it’s just taking the decision. Balance between maintaining extra patches to the buildsystem VS speedup win.
Your work sounds like a great start, thanks! Does it use openCL or CPU? Just asking to know how much margin for optimization there might be.
You should not hope to use OpenCL on the RPi. I worked on that for weeks to test the status last year, and while the driver was (and is still) in good development status, our model was too complicated for it and neither the maintainer or myself could find time to start working on the blocking items.
I would really like you to share more context, because I’m still not able to reproduce. This is on a RPi4, reinstalled right now, with the ice tower fan + heatspread:
pi@raspberrypi:~/ds $ for f in audio/*.wav; do echo $f; mediainfo $f | grep Duration; done; audio/2830-3980-0043.wav Duration : 1 s 975 ms Duration : 1 s 975 ms audio/4507-16021-0012.wav Duration : 2 s 735 ms Duration : 2 s 735 ms audio/8455-210777-0068.wav Duration : 2 s 590 ms Duration : 2 s 590 ms pi@raspberrypi:~/ds $ ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05 INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//4507-16021-0012.wav why should one halt on the way cpu_time_overall=3.24553 > audio//2830-3980-0043.wav experienced proof less cpu_time_overall=2.38253 > audio//8455-210777-0068.wav your power is sufficient i said cpu_time_overall=3.23032
So it’s consistent with the previous builds I did. Can you @dr0ptp4kt give more context on what you do ? How do you build / measure ?
Hi @lissyx! When I’m referring to inferencing, I’m talking about the inferencing-specific portion of the run with client.py.
I noticed that the GitHub link I posted looked like a simple fork link, but here’s the specific README.md that shows what I did:
The LM even from a warmed up filesystem cache is taking 1.28s to load on this 4 GB RAM Pi 4. So when that’s subtracted from total run, that makes a significant percentage-wise difference. In a an end user application context, what I’d do is have that LM pre-injected before the intake of voice data so that the only thing the client has to do is the inferencing. Of course a 1.8 GB LM isn’t going to fit into RAM on a device with 1 GB of RAM, so there I think the only good option is to fiddle with the the size (and therefore quality) of the LM, TRIE, and .tflite model files appropriate to the use case.
I’m not telling you anything new here, but it’s also of course possible to offload error correction to the retrieval system. In my Wikipedia use case I might be contented for lower RAM scenarios to forego or dramatically shrink the LM and TRIE, increase the size of the .tflite for greater precision (because there would still be RAM space available), and use some sort of optimized forgiving topic embedding / fuzzy matching scheme in the retrieval system, effectively moving part of the problem to the later stage. It’s of course possible to move those improvements into the audio detection run with DeepSpeech itself, but in the context of this binary, it’s about managing the RAM in stages so that the LM and TRIE don’t spill over and page to disk.
Anyway, it looks like your run and my run are pretty close in terms of general speed - it’s really close to taking about the same time to process as the length of the clip (and the inferencing specific part seems to take less time).
For your product roadmap, is the hope to be as fast as the incoming audio for realtime processing or something of that nature? How much optimization do you want? I’m really interested in helping with that (through raw algorithms and smart hacks on LM / TRIE / .tflite) or even with build system stuff if you’re open to it - but I also know you need to manage the product roadmap as well, so don’t want to be too imposing!
Keep up the great work! If it would work for you I’d be happy to discuss on video (or Freenode if you prefer).
I’m running without LM.
Ok, can you try with
deepspeech C++ binary and the
-t command line argument?
mmap()'d, so it’s not really a big issue.
What do you mean?
Here’s what I’m seeing with -t. Funny I missed the flag earlier
Using the LM:
$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio arctic_a0024.wav -t TensorFlow: v1.13.1-13-g174b4760eb DeepSpeech: v0.5.1-0-g4b29b78 it was my reports from the north which chiefly induced people to buy cpu_time_overall=3.25151
Not using the LM:
$ ./deepspeech --model deepspeech-0.5.1-models/output_graph.tflite --alphabet deepspeech-0.5.1-models/alphabet.txt --audio arctic_a0024.wav -t TensorFlow: v1.13.1-13-g174b4760eb DeepSpeech: v0.5.1-0-g4b29b78 it was my reports from the northwhich chiefly induced people to buy cpu_time_overall=6.95059
So part of the speed is definitely actual use of the LM.
I agree with you that the mmap’ing on the .tflite diminishes the negative effect of disk reads. As for the LM, but it’s definitely faster when injected in RAM. Are you sure that’s being consumed in an mmap’d fashion? I know it should be possible to mmap read of course, but it seems like that thing is taking up some 40s on initial run - that seems longer than I would expect if it were doing filesystem segment seeks in an mmap fashion; maybe this 40s on the first read is just because the client is fully consuming the file whereas it could be made to only consume the pointer…I haven’t dug into that part of the code beyond a quick scan. Gotta run, but interested to hear if you have tips.
For the product roadmap, I mainly just wanted to ensure that if I’d be posting patches they’d be valuable to you and the general user base of DeepSpeech. I know it’s an open source project and I’m free to fork, but I was hoping if there are problems that can be solved that are mutually beneficial I work on those ones. I reckon the last thing you need is patches that aren’t aligned with where you’re taking this software. Specifically, I was wondering how much optimization you want in this Rpi4 context. I was thinking that if it would be helpful, I might be inclined to post patches to address optimization to the level you’re hoping for. As for the build system, I also would be happy to help with build scripts and that sort of thing (e.g., Bazel stuff, cutting differently sized versions of models, etc.) - not sure if you’d need to me get shell access and requisite paperwork for that, though, or if that’s off limits or just not helpful - I can appreciate just how hard build and deploy pipelines are. I realize taskcluster sort of runs on arbitrary systems, but it’s also the case that I don’t have a multi-GPU machine, so much of the capabilities when it comes to the full build pipeline or the assumptions of things even as simple as keystores sort of seem to break down on my local dev rig.
I’m unsure exactly what you are suggesting here. You reported much faster inference than what we can achieve, so I’m trying to understand. Getting TFLite to run is not a problem, I’ve been experimenting with that for months now, so I know how to do it.
Maybe, but that’s not really what we are concerned about for now.
Could you please reproduce that with current master and using our audio files?
Also, could you please document what’s your basesystem ? Raspbian ? What’s your PSU ?
I’ll reproduce with the current master and share. I may be a bit occupied the next several days, just a heads up.
It’s stock Raspbian for the Raspberry Pi 4, using the Raspberry Pi official USB-C power adapter. I have a metal heatsink on the CPU and modem (heatsink for modem doesn’t fit on GPU or I’d put it there!) but no other thermal stuff going, no overclocking. Pretty normal setup.
@lissyx I found some time tonight to build this against master using bazel 0.24.1. I used the multistrap conf file for Stretch instead of Buster just to hold that variable constant. FWIW, I had actually produced a Buster build ~2 weeks ago for v.051 and saw similar performance between the Stretch and Buster builds at that time.
I just re-used the .tflite and lm.binary from the v0.51 model archive as you might infer from the directory names in the second run below. From my Mac I generated a new trie file for v4 instead of v3, then SCP’d that over to the Pi for use in the second run below (the faster one that uses the LM and TRIE)
Anyway, here’s what I’m seeing.
Without LM and TRIE. Quite similar in speed compared to the results you pasted in.
pi@pi4:~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --audio audio/ -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//8455-210777-0068.wav your power is sufficient i said cpu_time_overall=3.16360 > audio//4507-16021-0012.wav why should one halt on the way cpu_time_overall=3.08172 > audio//arctic_a0024.wav it was my reports from the northwhich chiefly induced people to buy cpu_time_overall=4.89014 > audio//2830-3980-0043.wav experienced proof less cpu_time_overall=2.30461
With LM and TRIE. Faster. The nice news here is that it appears that for arctic_a0024.wav it’s even faster than what I got with the v0.51 tflite build.
~/ds60 $ ./deepspeech --model output_graph.tflite --alphabet alphabet.txt --lm ~/ds/deepspeech-0.5.1-models/lm.binary --trie trie --audio audio/ -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.6-0-gccf1b2e INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//8455-210777-0068.wav your power is sufficient i said cpu_time_overall=2.10253 > audio//4507-16021-0012.wav why should one halt on the way cpu_time_overall=2.07445 > audio//arctic_a0024.wav it was my reports from the north which chiefly induced people to buy cpu_time_overall=3.17514 > audio//2830-3980-0043.wav experienced proof less cpu_time_overall=1.55579
Well if you run against a RPi4, you have Buster. It’d be more correct to use Buster when building, even though it should not make any difference.
All in all, we have the same setup, at least.
I’m starting to wonder if we have not regressed the way we measure time.
Ok, adding the LM, I’m getting similar results:
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05 INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//4507-16021-0012.wav why should one halt on the way cpu_time_overall=3.24919 > audio//2830-3980-0043.wav experienced proof less cpu_time_overall=2.37936 > audio//8455-210777-0068.wav your power is sufficient i said cpu_time_overall=3.20311 real 0m8.877s user 0m8.781s sys 0m0.091s pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.5-59-ga8a7af05 INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//4507-16021-0012.wav why should one halt on the way cpu_time_overall=2.14202 > audio//2830-3980-0043.wav experienced proof less cpu_time_overall=1.59947 > audio//8455-210777-0068.wav your power is sufficient i said cpu_time_overall=2.15120 real 0m6.810s user 0m5.890s sys 0m0.920s pi@raspberrypi:~/ds $
Okay, I know what’s happening. When we don’t load a LM, we won’t set an external scorer and thus this is not executed: https://github.com/mozilla/DeepSpeech/blob/ccf1b2e73ed161525a289ecf8d4e7beac9adad88/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L39-L44
DS_SpeechToText* implementation will, under the hood, rely on the Streaming API and this means
StreamingState::processBatch() gets computed a few times. That will call
Obivously, with the LM and the trie, the beam search is faster.
This is confirmed when checking execution time around
Here, without LM:
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --lm models/lm.binary --trie models/trie --extended -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.6-5-g5845505 INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//4507-16021-0012.wav ds_createstream_time=0.00002 decoder_state_time=0.00501 decoder_state_time=0.00368 decoder_state_time=0.02743 decoder_state_time=0.04064 decoder_state_time=0.02312 decoder_state_time=0.03033 decoder_state_time=0.04554 decoder_state_time=0.01943 decoder_state_time=0.01434 ds_create_time=1.59941 ds_finish_time=2.11518 why should one halt on the way cpu_time_overall=2.11524 cpu_time_decoding=0.08359 cpu_time_decodeall=0.08360 > audio//2830-3980-0043.wav ds_createstream_time=0.00001 decoder_state_time=0.00450 decoder_state_time=0.00636 decoder_state_time=0.01696 decoder_state_time=0.03406 decoder_state_time=0.02439 decoder_state_time=0.00506 decoder_state_time=0.00079 ds_create_time=1.09895 ds_finish_time=1.58255 experienced proof less cpu_time_overall=1.58259 cpu_time_decoding=0.07956 cpu_time_decodeall=0.07957 > audio//8455-210777-0068.wav ds_createstream_time=0.00001 decoder_state_time=0.00414 decoder_state_time=0.00451 decoder_state_time=0.02432 decoder_state_time=0.05007 decoder_state_time=0.04753 decoder_state_time=0.03541 decoder_state_time=0.04675 decoder_state_time=0.00631 decoder_state_time=0.00084 ds_create_time=1.62787 ds_finish_time=2.12236 your power is sufficient i said cpu_time_overall=2.12242 cpu_time_decoding=0.08898 cpu_time_decodeall=0.08899 real 0m6.732s user 0m5.850s sys 0m0.882s
And with LM:
pi@raspberrypi:~/ds $ time ./deepspeech --model models/output_graph.tflite --alphabet models/alphabet.txt --audio audio/ --extended -t TensorFlow: v1.14.0-14-g1aad02a78e DeepSpeech: v0.6.0-alpha.6-5-g5845505 INFO: Initialized TensorFlow Lite runtime. Running on directory audio/ > audio//4507-16021-0012.wav ds_createstream_time=0.00008 decoder_state_time=0.09759 decoder_state_time=0.11414 decoder_state_time=0.13162 decoder_state_time=0.15993 decoder_state_time=0.16111 decoder_state_time=0.14577 decoder_state_time=0.16027 decoder_state_time=0.16672 decoder_state_time=0.09356 ds_create_time=2.42510 ds_finish_time=3.16191 why should one halt on the way cpu_time_overall=3.16199 cpu_time_decoding=0.07662 cpu_time_decodeall=0.07662 > audio//2830-3980-0043.wav ds_createstream_time=0.00001 decoder_state_time=0.09842 decoder_state_time=0.11592 decoder_state_time=0.13760 decoder_state_time=0.16061 decoder_state_time=0.15751 decoder_state_time=0.14751 decoder_state_time=0.02784 ds_create_time=1.68183 ds_finish_time=2.32753 experienced proof less cpu_time_overall=2.32759 cpu_time_decoding=0.07175 cpu_time_decodeall=0.07176 > audio//8455-210777-0068.wav ds_createstream_time=0.00001 decoder_state_time=0.10290 decoder_state_time=0.11246 decoder_state_time=0.13637 decoder_state_time=0.15481 decoder_state_time=0.16675 decoder_state_time=0.18728 decoder_state_time=0.19037 decoder_state_time=0.18308 decoder_state_time=0.01142 ds_create_time=2.46809 ds_finish_time=3.14513 your power is sufficient i said cpu_time_overall=3.14518 cpu_time_decoding=0.08011 cpu_time_decodeall=0.08012 real 0m8.674s user 0m8.594s sys 0m0.081s
This does account for the difference of execution time.
@dr0ptp4kt Thanks for sharing your experiments, I’ve taken the bad habit to test without LM and did not pay attention enough when I early tested on the RPi4 that LM would have such an impact. Without your feedback, I would not have checked deeper. I think we’re going to switch to TFLite runtime for v0.6 on ARMv7 builds for RPi3/RPi4, and thus instruct people that they should be able to get faster than realtime on RPi4.
As much as I could test on Aarch64 boards we have (LePotato, S905X, https://libre.computer/products/boards/aml-s905x-cc/) the situation still holds and the SoC is not powerful enough for those decent performances.
I did investigate that, and found that we could tune the way the LM is loaded. Current master has a PR that improves things. On RPi4, the latency improves nicely. On Android devices, it’s not even visible as much as I could test :-).