Google's On-Device Speech Recognizer

sranjeet.visteon · March 18, 2019, 2:08pm

@lissyx It is a good to know that deepspeech quantization effort is better than the Google’s result! Is the TFLite quantized model of 46 MB available as pretrained model for testing as part of the release? If not, is there a procedure to generate a quantized model from the checkpoints that are obtained based on default deepspeech training process?

lissyx · March 18, 2019, 2:12pm

It’s all documented in README files …

sranjeet.visteon · March 18, 2019, 2:28pm

@lissyx There is a " Exporting a model for TFLite" in the README but this does not talk about quantization. We typically use this step to export the model into TFLIte from the checkpoints that we had trained and this produces almost 189MB of an acoustic model which is close to the .pbmm model.

lissyx · March 18, 2019, 2:31pm

Well, please read the code and report issues, because it’s working very well here. Quantization was enabled a few weeks ago at TFLite export.

lissyx · March 18, 2019, 3:40pm

eggonlea · March 18, 2019, 6:19pm

Just FYI. I can confirm the TFLite model works on Android. The problems I have right now are:

The accuracy is not good enough. E.g. it will not correctly recognize “one two three four”. Not sure if it’s related to the model conversion. I’ll set up a server version for comparison.
The LM is too huge. If we can get it down to a few hundreds of MB (or even less) it would be suitable to run on Android directly.

-rw-r–r-- 1 li li 213 Mar 14 11:05 ._alphabet.txt
-rw-r–r-- 1 li li 329 Mar 14 11:05 alphabet.txt
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._lm.binary
-rw-r–r-- 1 li li 1.7G Mar 14 11:05 lm.binary
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.pb
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.pbmm
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.rounded.pb
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.rounded.pbmm
-rw-r–r-- 1 li li 46M Mar 14 11:05 output_graph.tflite
-rw-r–r-- 1 li li 21M Mar 14 11:05 trie
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._trie

lissyx · March 18, 2019, 6:23pm

That depends on a lot of parameters, our testing with native american speakers gave good results, and myself being a non native american I could get some things pretty well. We measured WER impact and found that TF Lite prior to quantization was ~8.5% WER and with quantization it was ~10.2%

Yes, we know, this still needs somework and that is why the code is in the repo but not too much advertised about yet, or why there is not yet a .tflite file officially released.

eggonlea · March 18, 2019, 8:13pm

Is the 8.5 or 10.2 WER with or without LM? Yes, I’m not a native speaker which could be one of the reasons. Maybe I need set up a standard test environment with WSJ or LibriSpeech.

lissyx · March 18, 2019, 8:19pm

As much as I recall it was with the LM.

lissyx · March 18, 2019, 10:12pm

Another parameter was also the way the microphone is being accessed. It’s freshly merged, but no release yet, but you can use mozillaspeechlibrary with DeepSpeech now: https://github.com/mozilla/androidspeech/

And specifically, when working on that, I found that parameter to have a real impact: https://github.com/mozilla/androidspeech/commit/2bf0774519fa58249e214bfc34b72b1e742d50a1

sranjeet.visteon · March 25, 2019, 4:26pm

@lissyx I was able to generate a Quantized version of TFLite model which is only 47.3 MB and the inference results were close to the model before quantization, atleast with the samples that we had trained the original network.

I had a question about the quantization approach that is adopted, is the quantization going to only compress the model size for storage, is the resulting time for inference going to the same as the previous/bigger model?

Is there a plan to perform quantization aware training for DeepSpeech, so that the resulting inference is also faster and more accurate?

lissyx · March 25, 2019, 9:13pm

All the details are in https://github.com/mozilla/DeepSpeech/issues/1850

Not as of now, perfs are good enough for the current use cases priorities and we have to optimize the language model that is currently blocking us from having something complete.

eggonlea · April 2, 2019, 5:33pm

What’s our plan for LM? Switching to a DL model from n-gram? Or maybe switch to Google’s unified RNN-T directly (that will be a huge change).

reuben · April 2, 2019, 6:00pm

You pretty much nailed the options we’re exploring, but missed the conservative one: try to make the current system good enough.

Merlin · June 28, 2019, 4:48pm

Thanks for all your hard work everyone!

With the release of the Pi 4 with USB 3 showing big improvements both with and without the Coral USB Accelerator, I took another look at this today.

I can confirm @lissyx’s initial impressions from the online compiler ( TensorFlow Lite inference ) that the deepspeech tflite model is rejected. I also ran the recently released offline compiler which reported a more meaningful error: “Model not quantized”.

My understanding is limited, but I believe the reasoning is documented on this page (first blue note box):

Note: The Edge TPU does not support models built using post-training quantization, because although it creates a smaller model by using 8-bit values, it converts them back to 32-bit floats during inference. So you must use quantization-aware training, which uses “fake” quantization nodes to simulate the effect of 8-bit values during training, thus allowing inferences to run using the quantized values. This technique makes the model more tolerant of the lower precision values, which generally results in a higher accuracy model (compared to post-training quantization).

which I think @sranjeet.visteon was touching on in this thread.

Given the new potential for the Pi 4 + Edge TPU,
( https://blog.hackster.io/benchmarking-machine-learning-on-the-new-raspberry-pi-4-model-b-88db9304ce4 ),
I’d be grateful if the devs could take another look at both the Pi 4 and Edge TPUs when considering future priorities.

lissyx · June 28, 2019, 5:03pm

Nice finding, I failed finding this one. Quantization-aware training was something I wanted to try, so that makes a good excuse.

Well, I did order two Pi4 to play with then but kubii.fr failed me and I’ll have to wait another four weeks before getting them …

But honestly, I’d prefer if we can get it working on Pi4 without the EdgeTPU.

Merlin · June 28, 2019, 6:18pm

Thanks lissyx, I’m glad I could help.

Yeah, completely understandable. In this context, I’d personally only invest in the USB Accelerator if it was the difference between having realtime performance or not (for the various smart home use cases in particular).

If the quantization-aware training works out and you want to give the new offline Edge TPU compiler a try for fun, this page should get you kickstarted:

Thanks!

lissyx · June 28, 2019, 6:19pm

This will sadly have to wait until the heat wave leaves (and I have some time available, obviously), I really can’t build / train anything right now, it’s 20:20 and still 37.5°C outside.

Merlin · June 28, 2019, 6:20pm

Woah, that’s crazy. No rush! Stay cool

lissyx · July 25, 2019, 4:44pm

So, looks like situation evolved, now quantization aware training is not required anymore.

With that:

             converter = tf.lite.TFLiteConverter(frozen_graph, input_tensors=inputs.values(), output_tensors=outputs.values())
             converter.post_training_quantize = True
+            converter.inference_type = tf.lite.constants.QUANTIZED_UINT8
+            converter.quantized_input_stats = {
+                'input_samples': (0., 1.),
+                'Reshape': (0., 1.),
+                'previous_state_c': (0., 1.),
+                'previous_state_h': (0., 1.),
+            }
+            converter.default_ranges_stats = (0, 128)
             # AudioSpectrogram and Mfcc ops are custom but have built-in kernels in TFLite
             converter.allow_custom_ops = True
             tflite_model = converter.convert()
@@ -596,12 +596,6 @@ def test():
 def create_inference_graph(batch_size=1, n_steps=16, tflite=False):
     batch_size = batch_size if batch_size > 0 else None

-    # Create feature computation graph
-    input_samples = tfv1.placeholder(tf.float32, [Config.audio_window_samples], 'input_samples')
-    samples = tf.expand_dims(input_samples, -1)
-    mfccs, _ = samples_to_mfccs(samples, FLAGS.audio_sample_rate)
-    mfccs = tf.identity(mfccs, name='mfccs')
-
     # Input tensor will be of shape [batch_size, n_steps, 2*n_context+1, n_input]
     # This shape is read by the native_client in DS_CreateModel to know the
     # value of n_steps, n_context and n_input. Make sure you update the code

I can get the model converted for EdgeTPU. Now, we still need to take care of AudioSpectrogram and Mfcc operators on CPU.