Google's On-Device Speech Recognizer

@lissyx It is a good to know that deepspeech quantization effort is better than the Google’s result! Is the TFLite quantized model of 46 MB available as pretrained model for testing as part of the release? If not, is there a procedure to generate a quantized model from the checkpoints that are obtained based on default deepspeech training process?

It’s all documented in README files …

@lissyx There is a " Exporting a model for TFLite" in the README but this does not talk about quantization. We typically use this step to export the model into TFLIte from the checkpoints that we had trained and this produces almost 189MB of an acoustic model which is close to the .pbmm model.

Well, please read the code and report issues, because it’s working very well here. Quantization was enabled a few weeks ago at TFLite export.

Just FYI. I can confirm the TFLite model works on Android. The problems I have right now are:

  1. The accuracy is not good enough. E.g. it will not correctly recognize “one two three four”. Not sure if it’s related to the model conversion. I’ll set up a server version for comparison.

  2. The LM is too huge. If we can get it down to a few hundreds of MB (or even less) it would be suitable to run on Android directly.

-rw-r–r-- 1 li li 213 Mar 14 11:05 ._alphabet.txt
-rw-r–r-- 1 li li 329 Mar 14 11:05 alphabet.txt
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._lm.binary
-rw-r–r-- 1 li li 1.7G Mar 14 11:05 lm.binary
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.pb
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.pbmm
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.rounded.pb
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.rounded.pbmm
-rw-r–r-- 1 li li 46M Mar 14 11:05 output_graph.tflite
-rw-r–r-- 1 li li 21M Mar 14 11:05 trie
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._trie

That depends on a lot of parameters, our testing with native american speakers gave good results, and myself being a non native american I could get some things pretty well. We measured WER impact and found that TF Lite prior to quantization was ~8.5% WER and with quantization it was ~10.2%

Yes, we know, this still needs somework and that is why the code is in the repo but not too much advertised about yet, or why there is not yet a .tflite file officially released.

Is the 8.5 or 10.2 WER with or without LM? Yes, I’m not a native speaker which could be one of the reasons. Maybe I need set up a standard test environment with WSJ or LibriSpeech.

As much as I recall it was with the LM.

Another parameter was also the way the microphone is being accessed. It’s freshly merged, but no release yet, but you can use mozillaspeechlibrary with DeepSpeech now: https://github.com/mozilla/androidspeech/

And specifically, when working on that, I found that parameter to have a real impact: https://github.com/mozilla/androidspeech/commit/2bf0774519fa58249e214bfc34b72b1e742d50a1

@lissyx I was able to generate a Quantized version of TFLite model which is only 47.3 MB and the inference results were close to the model before quantization, atleast with the samples that we had trained the original network.

I had a question about the quantization approach that is adopted, is the quantization going to only compress the model size for storage, is the resulting time for inference going to the same as the previous/bigger model?

Is there a plan to perform quantization aware training for DeepSpeech, so that the resulting inference is also faster and more accurate?

All the details are in https://github.com/mozilla/DeepSpeech/issues/1850

Not as of now, perfs are good enough for the current use cases priorities and we have to optimize the language model that is currently blocking us from having something complete.

What’s our plan for LM? Switching to a DL model from n-gram? Or maybe switch to Google’s unified RNN-T directly (that will be a huge change).

You pretty much nailed the options we’re exploring, but missed the conservative one: try to make the current system good enough.

Thanks for all your hard work everyone!

With the release of the Pi 4 with USB 3 showing big improvements both with and without the Coral USB Accelerator, I took another look at this today.

I can confirm @lissyx’s initial impressions from the online compiler ( TensorFlow Lite inference ) that the deepspeech tflite model is rejected. I also ran the recently released offline compiler which reported a more meaningful error: “Model not quantized”.

My understanding is limited, but I believe the reasoning is documented on this page (first blue note box):

Note: The Edge TPU does not support models built using post-training quantization, because although it creates a smaller model by using 8-bit values, it converts them back to 32-bit floats during inference. So you must use quantization-aware training, which uses “fake” quantization nodes to simulate the effect of 8-bit values during training, thus allowing inferences to run using the quantized values. This technique makes the model more tolerant of the lower precision values, which generally results in a higher accuracy model (compared to post-training quantization).

which I think @sranjeet.visteon was touching on in this thread.

Given the new potential for the Pi 4 + Edge TPU,
( https://blog.hackster.io/benchmarking-machine-learning-on-the-new-raspberry-pi-4-model-b-88db9304ce4 ),
I’d be grateful if the devs could take another look at both the Pi 4 and Edge TPUs when considering future priorities.

1 Like

Nice finding, I failed finding this one. Quantization-aware training was something I wanted to try, so that makes a good excuse.

Well, I did order two Pi4 to play with then but kubii.fr failed me and I’ll have to wait another four weeks before getting them …

But honestly, I’d prefer if we can get it working on Pi4 without the EdgeTPU.

Thanks lissyx, I’m glad I could help.

Yeah, completely understandable. In this context, I’d personally only invest in the USB Accelerator if it was the difference between having realtime performance or not (for the various smart home use cases in particular).

If the quantization-aware training works out and you want to give the new offline Edge TPU compiler a try for fun, this page should get you kickstarted:

Thanks!

This will sadly have to wait until the heat wave leaves (and I have some time available, obviously), I really can’t build / train anything right now, it’s 20:20 and still 37.5°C outside.

Woah, that’s crazy. No rush! Stay cool :slight_smile:

So, looks like situation evolved, now quantization aware training is not required anymore.

With that:

             converter = tf.lite.TFLiteConverter(frozen_graph, input_tensors=inputs.values(), output_tensors=outputs.values())
             converter.post_training_quantize = True
+            converter.inference_type = tf.lite.constants.QUANTIZED_UINT8
+            converter.quantized_input_stats = {
+                'input_samples': (0., 1.),
+                'Reshape': (0., 1.),
+                'previous_state_c': (0., 1.),
+                'previous_state_h': (0., 1.),
+            }
+            converter.default_ranges_stats = (0, 128)
             # AudioSpectrogram and Mfcc ops are custom but have built-in kernels in TFLite
             converter.allow_custom_ops = True
             tflite_model = converter.convert()
@@ -596,12 +596,6 @@ def test():
 def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     batch_size = batch_size if batch_size > 0 else None

-    # Create feature computation graph
-    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
-    samples = tf.expand_dims(input_samples, -1)
-    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
-    mfccs = tf.identity(mfccs, name='mfccs')
-
     # Input tensor will be of shape [batch_size, n_steps, 2*n_context+1, n_input]
     # This shape is read by the native_client in DS_CreateModel to know the
     # value of n_steps, n_context and n_input. Make sure you update the code

I can get the model converted for EdgeTPU. Now, we still need to take care of AudioSpectrogram and Mfcc operators on CPU.