Google's On-Device Speech Recognizer

As much as I recall it was with the LM.

Another parameter was also the way the microphone is being accessed. It’s freshly merged, but no release yet, but you can use mozillaspeechlibrary with DeepSpeech now:

And specifically, when working on that, I found that parameter to have a real impact:

@lissyx I was able to generate a Quantized version of TFLite model which is only 47.3 MB and the inference results were close to the model before quantization, atleast with the samples that we had trained the original network.

I had a question about the quantization approach that is adopted, is the quantization going to only compress the model size for storage, is the resulting time for inference going to the same as the previous/bigger model?

Is there a plan to perform quantization aware training for DeepSpeech, so that the resulting inference is also faster and more accurate?

All the details are in

Not as of now, perfs are good enough for the current use cases priorities and we have to optimize the language model that is currently blocking us from having something complete.

What’s our plan for LM? Switching to a DL model from n-gram? Or maybe switch to Google’s unified RNN-T directly (that will be a huge change).

You pretty much nailed the options we’re exploring, but missed the conservative one: try to make the current system good enough.

Thanks for all your hard work everyone!

With the release of the Pi 4 with USB 3 showing big improvements both with and without the Coral USB Accelerator, I took another look at this today.

I can confirm @lissyx’s initial impressions from the online compiler ( TensorFlow Lite inference ) that the deepspeech tflite model is rejected. I also ran the recently released offline compiler which reported a more meaningful error: “Model not quantized”.

My understanding is limited, but I believe the reasoning is documented on this page (first blue note box):

Note: The Edge TPU does not support models built using post-training quantization, because although it creates a smaller model by using 8-bit values, it converts them back to 32-bit floats during inference. So you must use quantization-aware training, which uses “fake” quantization nodes to simulate the effect of 8-bit values during training, thus allowing inferences to run using the quantized values. This technique makes the model more tolerant of the lower precision values, which generally results in a higher accuracy model (compared to post-training quantization).

which I think @sranjeet.visteon was touching on in this thread.

Given the new potential for the Pi 4 + Edge TPU,
( ),
I’d be grateful if the devs could take another look at both the Pi 4 and Edge TPUs when considering future priorities.

1 Like

Nice finding, I failed finding this one. Quantization-aware training was something I wanted to try, so that makes a good excuse.

Well, I did order two Pi4 to play with then but failed me and I’ll have to wait another four weeks before getting them …

But honestly, I’d prefer if we can get it working on Pi4 without the EdgeTPU.

Thanks lissyx, I’m glad I could help.

Yeah, completely understandable. In this context, I’d personally only invest in the USB Accelerator if it was the difference between having realtime performance or not (for the various smart home use cases in particular).

If the quantization-aware training works out and you want to give the new offline Edge TPU compiler a try for fun, this page should get you kickstarted:


This will sadly have to wait until the heat wave leaves (and I have some time available, obviously), I really can’t build / train anything right now, it’s 20:20 and still 37.5°C outside.

Woah, that’s crazy. No rush! Stay cool :slight_smile:

So, looks like situation evolved, now quantization aware training is not required anymore.

With that:

             converter = tf.lite.TFLiteConverter(frozen_graph, input_tensors=inputs.values(), output_tensors=outputs.values())
             converter.post_training_quantize = True
+            converter.inference_type = tf.lite.constants.QUANTIZED_UINT8
+            converter.quantized_input_stats = {
+                'input_samples': (0., 1.),
+                'Reshape': (0., 1.),
+                'previous_state_c': (0., 1.),
+                'previous_state_h': (0., 1.),
+            }
+            converter.default_ranges_stats = (0, 128)
             # AudioSpectrogram and Mfcc ops are custom but have built-in kernels in TFLite
             converter.allow_custom_ops = True
             tflite_model = converter.convert()
@@ -596,12 +596,6 @@ def test():
 def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     batch_size = batch_size if batch_size > 0 else None

-    # Create feature computation graph
-    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
-    samples = tf.expand_dims(input_samples, -1)
-    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
-    mfccs = tf.identity(mfccs, name='mfccs')
     # Input tensor will be of shape [batch_size, n_steps, 2*n_context+1, n_input]
     # This shape is read by the native_client in DS_CreateModel to know the
     # value of n_steps, n_context and n_input. Make sure you update the code

I can get the model converted for EdgeTPU. Now, we still need to take care of AudioSpectrogram and Mfcc operators on CPU.

Going further require more work than I can do on that topic right now. It requires making libedgetpu available in tensorflow tree, then build / link with it in DeepSpeech codebase:

diff --git a/native_client/BUILD b/native_client/BUILD
index bf4e1d2..2b24047 100644
--- a/native_client/BUILD
+++ b/native_client/BUILD
@@ -71,7 +71,8 @@ tf_cc_shared_object(
-            "ds_graph_version.h"] +
+            "ds_graph_version.h",
+           "@libedgetpu//:edgetpu.h"] +
     copts = select({ 
         # -fvisibility=hidden is not required on Windows, MSCV hides all declarations by default
@@ -94,6 +95,7 @@ tf_cc_shared_object(
     deps = select({
         "//native_client:tflite": [
+            "@libedgetpu//:lib",
         "//conditions:default": [
diff --git a/native_client/ b/native_client/
index 526c176..3339ecb 100644
--- a/native_client/
+++ b/native_client/
@@ -22,6 +22,7 @@
 #else // USE_TFLITE
   #include "tensorflow/lite/model.h"
   #include "tensorflow/lite/kernels/register.h"
+  #include "edgetpu.h"
 #endif // USE_TFLITE
 #include "ctcdecode/ctc_beam_search_decoder.h"
@@ -725,14 +726,18 @@ DS_CreateModel(const char* aModelPath,
     return DS_ERR_FAIL_INIT_MMAP;
+  auto tpu_context = edgetpu::EdgeTpuManager::GetSingleton()->NewEdgeTpuContext();
   tflite::ops::builtin::BuiltinOpResolver resolver;
+  resolver.AddCustom(edgetpu::kCustomOp, edgetpu::RegisterCustomOp());
   tflite::InterpreterBuilder(*model->fbmodel, resolver)(&model->interpreter);
   if (!model->interpreter) {
     std::cerr << "Error at InterpreterBuilder for model file " << aModelPath << std::endl;
+  model->interpreter->SetExternalContext(kTfLiteEdgeTpuContext, tpu_context.get());
diff --git a/native_client/ b/native_client/
index da404a6..3563976 100644
--- a/native_client/
+++ b/native_client/
@@ -5,11 +5,11 @@ TFDIR     ?= $(abspath $(NC_DIR)/../../tensorflow)
 PREFIX    ?= /usr/local
 SO_SEARCH ?= $(TFDIR)/bazel-bin/
-TOOL_AS   := as
-TOOL_CC   := gcc
-TOOL_CXX  := c++
-TOOL_LD   := ld
-TOOL_LDD  := ldd
+TOOL_AS   ?= as
+TOOL_CC   ?= gcc
+TOOL_CXX  ?= c++
+TOOL_LD   ?= ld
+TOOL_LDD  ?= ldd
 DEEPSPEECH_BIN       := deepspeech


Properly dealing with AudioSpectrogram/Mfcc on CPU and feeding the values to the EdgeTPU device.

EdgeTPU compiler output:

Output: output_graph_edgetpu.tflite

Operator                       Count      Status

SOFTMAX                        1          Mapped to Edge TPU
FULLY_CONNECTED                6          Mapped to Edge TPU
MINIMUM                        4          Mapped to Edge TPU
TANH                           2          Mapped to Edge TPU
STRIDED_SLICE                  1          Mapped to Edge TPU
LOGISTIC                       3          Mapped to Edge TPU
ADD                            2          Mapped to Edge TPU
SPLIT                          1          Mapped to Edge TPU
MUL                            3          Mapped to Edge TPU
CONCATENATION                  1          Mapped to Edge TPU
RESHAPE                        1          Mapped to Edge TPU

Hello, has it been possible to test the tflite graph on the edgetpu? Would it be possible to get some steps to replicate this output or get access to your tflite output file? Thanks

Everything needed is here, and on Github …

I have attempted to reproduce your output using the DeepSpeech v0.5.1 branch and am getting this error when using v0.5.1 checkpoints:
ValueError: Quantization input stats are not available for input tensors 'input_node'.

Additionally, I have attempted to use the master branch using v0.5.1 checkpoints and am getting this error:
Key cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias not found in checkpoint [[node save/RestoreV2 (defined at ]]

Will I have to train a new model if I want to use the master branch since there are compatibility issues between the master branch and the provided checkpoints?

Well, I did work on top of master, so it’s not surprising …

Yes, you need newer checkpoints, we don’t support v0.5 checkpoints on current master.

But what do you want to achieve? I’ve already documented the state of running on EdgeTPU.

Thanks for the information.
Are there any pre-trained checkpoints available for v0.6?
I’m trying to see if it is at all possible to get speech inference running on raspberry pis with an edgetpu accelerator to compare the speed against a desktop CPU.

Not yet.

What do you think I tried ? :slight_smile: