Video and benchmarking results

Name: Offline Speech Recognition on Raspberry Pi 4/Nvidia Jetson Nano with Respeaker
Uploaded: 2020-01-29T18:20:19Z
Description: I stumbled upon DeepSpeech project a few weeks ago, when searching for a suitable ASR engine for my video and article about speech recognition on embedded devices. I was really impressed with perform...

dmitrywat · January 29, 2020, 6:20pm

I stumbled upon DeepSpeech project a few weeks ago, when searching for a suitable ASR engine for my video and article about speech recognition on embedded devices.
I was really impressed with performance and speed of 0.6 version! So I decided to write and publish an article about benchmarking 0.6 DeepSpeech engine and creating a transcription demo for Raspberry Pi 4/Jetson Nano.

Here are the links to the article and the video. Hope it will bring more publicity to such an outstanding project!

(I already corrected the typo in the description )

While running tests on Nvidia Jetson Nano I used a pre-release wheel for arm64 architecture with tflite model inference enabled by default. The performance was slightly worse than on Raspberry Pi 4. I am interested to try running it on Jetson Nano with GPU acceleration - are there plans for releasing arm64 with GPU support? If not, I will try cross-compiling it myself, are there any caveats that I should know beforehand?
Thank you for such an amazing project!

lissyx · January 29, 2020, 6:29pm

It would be great to have more figures than just “slightly worse”, maybe there’s some trivial / actionable item here. But if I read correctly, it’s Cortex-A57, so close to the RPi3. It’d be interesting to know more. Re-training a simpler (n_hidden being lower 2048) might help a lot here, if you manage to keep good accuracy.

Please also ensure if you tested with the language model, this can make a difference: not having it will slow down things enough.

No, because this is non trivial work, especially to get CI covered, and we are trying to move away from GPUs.

It should work properly, I know @elpimous_robot successfully did it. Basically you should just follow the cross-compilation docs we have, but you might need specific tuning to ensure your sysroot directory includes CUDA-stuff. And obviously you need to adapt to use CUDA so --config=cuda.

Given we don’t have the hardware, it’ll be hard to help more.

dmitrywat · January 31, 2020, 3:40pm

There is comparison table in the article I used language model for tests on all platforms. 10 times average(first time discarded) for Raspberry Pi 4 (1 GB) is 1.6 seconds and for Jetson Nano is 2.3 seconds.
Yes, alas, I also don’t have Jetson Nano at hand now. Had to leave China for the time being because of the coronavirus outbreak
I’m curious, why are you moving away from GPUs now?

lissyx · January 31, 2020, 3:44pm

For local inference, GPUs are far from being the most flexible solution: we are only limited to CUDA, it’s not an efficient usage of the power we have, it adds complex dependencies. Basically we are wondering about stopping usage of plain TensorFlow runtime for any local inference and just move everything to TFLite there. GPU / TensorFlow full-blown would still make sense in some other use-cases, but the benefits of switching all local inference usages to TFLite brings a lot of joy in me.

lissyx · January 31, 2020, 3:46pm

Well, sorry, but you never mentionned that, and it’s shared as image, so it’s not accessible / not searchable easily.

How much audio is that ? The difference is not that huge, it seems the Jetson Nano is more efficient than the RPi3.

kumakichi · February 5, 2020, 5:57am

@dmitrywat
Hi dmitrywat, I am also working on jetson xavier with deepspeech, and interested in trying tensorflow lite.
Could you please kindly share the file
https://community-tc.services.mozilla.com/api/queue/v1/task/KZMAnYo2Qy2-icrTp5Ldqw/runs/0/artifacts/public/deepspeech-0.6.1-cp37-cp37m-linux_aarch64.whl
as the link seems to be invalid.
Thank you.

lissyx · February 5, 2020, 8:05am

Please just use latest alpha 1 release on Github, the file here will work.

lissyx · February 5, 2020, 8:07am

https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0-alpha.1/deepspeech-0.7.0a1-cp37-cp37m-linux_aarch64.whl

kumakichi · February 6, 2020, 8:06am

Hi lissyx,

Sorry for the long post
And thank you in advance for your patience and help.

On my Xavier I tried the 7.0 deepspeech whl file you told me.
But I could not get a better performance on memory consumption as discribed in
DeepSpeech 0.6: Mozilla's Speech-to-Text Engine Gets Fast, Lean, and Ubiquitous - Mozilla Hacks - the Web developer blog

We now use 22 times less memory …

Details

What I did:

I prepared 2 different environments (docker) to compare the performance between deepspeech
with tensorflow (3.6 someone release Releases · domcross/DeepSpeech-for-Jetson-Nano · GitHub)
and
with tensorflow lite(python3.7, you provided)

For python3.6
As someone released the deepspeech 6.0 for Arm64 in
Releases · domcross/DeepSpeech-for-Jetson-Nano · GitHub
1.I Downloaded the DeepSpeech-0.6.0 wheel from this release, then pip3.6 install deepspeech-0.6.0-cp36-cp36m-linux_aarch64.whl
2. Download the libdeepspeech.so file as well and put it in your search path.

For python3.7
I install all the requirements for deepspeech(7.0) and then installed
deepspeech-0.7.0a1-cp37-cp37m-linux_aarch64.whl

the information for my xavier is

root@DeepSpeech_v060:~# uname -a
Linux DeepSpeech_v060 4.9.140-tegra #1 SMP PREEMPT Tue Nov 5 13:37:19 PST 2019 aarch64 aarch64 aarch64 GNU/Linux

Software:
* Name: NVIDIA Jetson AGX Xavier
* Type: AGX Xavier
* Jetpack: UNKNOWN [L4T 32.2.3]
* GPU-Arch: 7.2
- Libraries:
* CUDA: 10.0.326
* cuDNN: 7.6.3.28-1+cuda10.0
* TensorRT: 6.0.1.10-1+cuda10.0
* VisionWorks: 1.6.0.500n
* OpenCV: 4.1.1 compiled CUDA: YES

the docker information:

docker pull nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-samples

I tested the example in
DeepSpeech-examples/mic_vad_streaming at r0.6 · mozilla/DeepSpeech-examples · GitHub

①22s to load
root@DeepSpeech_v060_lite:~# python3.7 mic_vad_wakeup_060_local.py -v 0 --model ./deepspeech-0.6.1-models/output_graph.tflite --lm ./deepspeech-0.6.1-models/lm.binary --trie ./deepspeech-0.5.1-models/trie

②144s to load
root@DeepSpeech_v060:~# python3.6 mic_vad_wakeup_060_local.py -v 0 --model ./deepspeech-0.6.1-models/output_graph.pbmm --lm ./deepspeech-0.6.1-models/lm.binary --trie ./deepspeech-0.5.1-models/trie

though did take much less time to load the tflite model
by loading the model I mean from executing ① or ② to :

Listening (or press ctrl-c to exit)

the performance of memory on Xavier is like below:

free -m ↓
unit(mb) total used free shared buff/cache available
□Mem: 15690 1500 12768 22 1420 13948(before loading Deepspeech)
①Mem: 15690 2457 11701 23 1530 12982(deepspeech.whl 7.0+Tensorflow Lite)
②Mem: 15690 2992 10939 22 1758 12451(deepspeech.whl 6.0+Tensorflow)

Did I miss any step?
And please kindly tell the fuction of libdeepspeech.so
#if I only used python, do I need libdeepspeech.so also?
In your 7.0 release page there is no specific version of libdeepspeech.so for Arm64
Does that mean there is no need to update the libdeepspeech.so?

Sorry for the long post
And thank you in advance for your patience and help.

lissyx · February 6, 2020, 8:33am

So you are comparing using someone else’s release ?

You can’t seriously rely on that for measuring memory usage.

What libdeepspeech.so are you refering to?

This is useless, we have ARM64 TensorFlow builds for 0.6.1 as well. Again, you can’t compare seriously using random sources like that.

Those figures holds. They were measured on desktop builds, AMD64, through analyzing of memory allocation using valgrind massif. Please replicate accordingly to verify correctly.

THere is no “7.0”, there are only some 0.7.0 alpha builds, and they have ARM64 builds …

lissyx · February 6, 2020, 8:33am

Here: https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0-alpha.1/deepspeech-0.7.0a1-cp37-cp37m-linux_aarch64.whl

kumakichi · February 6, 2020, 9:07am

Yes.

I will set another docker and use the 0.6.1 or 0.6.0 whl you provide and compare again.

Oh the one in github.com/domcross/DeepSpeech-for-Jetson-Nano/releases

In my understand python binding(.whl file?) is like a wrapper to call the C compiled library->libdeepspeech.so. Is it unnecessary for merely python usage?
I can’t find libdeepspeech.so in your release page for Arm64.

Sorry for the typo ,I did used the 0.7.0 alpha1 version.

Thank you for your time.

lissyx · February 6, 2020, 9:50am

Please don’t use 0.6.0 on TFLite, there was a bug.

I insist, we have 0.6.1 and 0.7.0 alpha that is compatible with 0.6.1 model, and 0.6.1 on ARM64 uses the TensorFlow runtime when the 0.7.0 alpha uses the TFLite runtime. There is no need to use any third-party that we don’t know what they did.

There’s no magic, it’s all visible in the git tree: it’s using the libdeepspeech.so. I don’t get the second part of your question.

What I see, though, is that you mix stuff, and install at several places magic libraries. In this context, it’s really messy to know for sure what you are running.

Make an effort ? I linked it to you. The Python wheel packages the library, there’s no need for magic stuff.

lissyx · February 6, 2020, 9:52am

0.6.1 ARM64 with full TensorFlow runtime: https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-cp37-cp37m-linux_aarch64.whl
0.7.0-alpha.1 ARM64 with TFLite runtime: https://github.com/mozilla/DeepSpeech/releases/download/v0.7.0-alpha.1/deepspeech-0.7.0a1-cp37-cp37m-linux_aarch64.whl

Install both, compare.

But your comparison will be hugely biaised since you get the Python runtime in the same … That’s why we did measure with the C++ native client.

kumakichi · February 6, 2020, 10:17am

Thank you again for your patience.

Oh I didn’t know that, I thought .so and whl files are totally independent.
As in the build instruction the .so file and whl files are built and installed separately.
And I thought by merely doing pip3 install .whl would not be enough to run the vad sample.(I thought I also need to find/build a .so file for 0.7.0 and include it into the run path.

Sorry I don’t understand this part, I get the python runtime in the same__? Does this mean because both the 0.6.1 and 0.7.0 DeepSpeech are in python runtime thus the comparison result will be biased, even if they are to be run in two different docker container?

lissyx · February 6, 2020, 10:25am

This is floss, you can just verify instead of thinking. I know it might non trivial sometimes, but I’m always sad to see that people don’t check when they have everything they can, and keep a wrong idea in mind.

Yes, because they involves different tooling and constraints.

And yet we document everything. What is unclear please, so we can improve ?

If you valgrind --tool=massif on the Python inference code, you will also measure the python memory footprint. Which might give you different values in term of usage wrt libdeepspeech.so itself.

It should be the same biais between both Python runtimes, but since it’s an interpreted language it’s much harder to be certain. And more critically, it will be biaised against what you highlighted in the hacks.mozilla.org blog post.

I assume you want to see / measure the improvement we did, so it’s only fair and meaningful if you compare as we did.

Topic		Replies	Views
Prebuild deep speech binary for tensorflow lite model on raspberry pi 3? DeepSpeech	32	4351	September 26, 2019
Google's On-Device Speech Recognizer DeepSpeech	37	6108	November 11, 2019
Unable to install deepspeech on centos 6.9 DeepSpeech	36	4117	March 5, 2018
Alternative Platforms DeepSpeech	1	2583	November 21, 2017
How to use the pretrained tflite model? DeepSpeech	33	6278	May 6, 2020

Video and benchmarking results

Details

Related topics