Google's On-Device Speech Recognizer

nmstoker · March 13, 2019, 1:12pm

Hopefully this is inspiration rather than disheartening! It seems incredible they’ve got the model down to 80 Mb!

lissyx · March 13, 2019, 1:14pm

Our TFLite quantized model is 46MB and runs ~ twice real time on Android, and this is in our repo for a few weeks now. We still have work to do to get good language model there, current ones are too big.

nmstoker · March 13, 2019, 4:25pm

Excellent news - thanks for that info, I hadn’t realised and will take a look when I get back from work.

mayank.bhaskar · March 14, 2019, 11:52am

They have released the offline part for the Pixel devices on Android only. No news for the iOS counterpart yet.

Do you guys have info. about a standalone offline iOS model from DeepSpeech?

kdavis · March 14, 2019, 12:11pm

I just looked a bit more in detail at Google’s post.

Their claim of 80MB is for a compressed model. After zipping, Mozilla’s acoustic model is 23.2MB, much much smaller than Google’s 80MB.

(The original 1,2 MB I had here was in the size column of the finder but that was, confusingly, only the size currently uploaded to iCloudDrive, not the size of the file.)

lissyx · March 14, 2019, 12:43pm

we don’t have plans for that as of now

mayank.bhaskar · March 14, 2019, 12:55pm

Could you please redirect me to someone who is working on the iOS standalone part?
If you guys have interest or future plans on that, I have some ideas so as how the same can be optimised/implemented on the iOS platform.
I may be wrong, but it can help the entire community and may help the Mozilla’s DeepSpeech open-source project in return.

lissyx · March 14, 2019, 12:56pm

As I said, we have no plan to work on iOS support for now, we don’t have enough resources to take care of that. If you are able to get it running, feel free to fork and send patches.

sranjeet.visteon · March 18, 2019, 2:08pm

@lissyx It is a good to know that deepspeech quantization effort is better than the Google’s result! Is the TFLite quantized model of 46 MB available as pretrained model for testing as part of the release? If not, is there a procedure to generate a quantized model from the checkpoints that are obtained based on default deepspeech training process?

lissyx · March 18, 2019, 2:12pm

It’s all documented in README files …

sranjeet.visteon · March 18, 2019, 2:28pm

@lissyx There is a " Exporting a model for TFLite" in the README but this does not talk about quantization. We typically use this step to export the model into TFLIte from the checkpoints that we had trained and this produces almost 189MB of an acoustic model which is close to the .pbmm model.

lissyx · March 18, 2019, 2:31pm

Well, please read the code and report issues, because it’s working very well here. Quantization was enabled a few weeks ago at TFLite export.

lissyx · March 18, 2019, 3:40pm

eggonlea · March 18, 2019, 6:19pm

Just FYI. I can confirm the TFLite model works on Android. The problems I have right now are:

The accuracy is not good enough. E.g. it will not correctly recognize “one two three four”. Not sure if it’s related to the model conversion. I’ll set up a server version for comparison.
The LM is too huge. If we can get it down to a few hundreds of MB (or even less) it would be suitable to run on Android directly.

-rw-r–r-- 1 li li 213 Mar 14 11:05 ._alphabet.txt
-rw-r–r-- 1 li li 329 Mar 14 11:05 alphabet.txt
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._lm.binary
-rw-r–r-- 1 li li 1.7G Mar 14 11:05 lm.binary
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.pb
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.pbmm
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.rounded.pb
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.rounded.pbmm
-rw-r–r-- 1 li li 46M Mar 14 11:05 output_graph.tflite
-rw-r–r-- 1 li li 21M Mar 14 11:05 trie
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._trie

lissyx · March 18, 2019, 6:23pm

That depends on a lot of parameters, our testing with native american speakers gave good results, and myself being a non native american I could get some things pretty well. We measured WER impact and found that TF Lite prior to quantization was ~8.5% WER and with quantization it was ~10.2%

Yes, we know, this still needs somework and that is why the code is in the repo but not too much advertised about yet, or why there is not yet a .tflite file officially released.

eggonlea · March 18, 2019, 8:13pm

Is the 8.5 or 10.2 WER with or without LM? Yes, I’m not a native speaker which could be one of the reasons. Maybe I need set up a standard test environment with WSJ or LibriSpeech.

lissyx · March 18, 2019, 8:19pm

As much as I recall it was with the LM.

lissyx · March 18, 2019, 10:12pm

Another parameter was also the way the microphone is being accessed. It’s freshly merged, but no release yet, but you can use mozillaspeechlibrary with DeepSpeech now: GitHub - mozilla/androidspeech: DEPRECATED - An Android library module to Mozilla's Speech-To-Text services

And specifically, when working on that, I found that parameter to have a real impact: Switch to VOICE_RECOGNITION audio source · mozilla/androidspeech@2bf0774 · GitHub

sranjeet.visteon · March 25, 2019, 4:26pm

@lissyx I was able to generate a Quantized version of TFLite model which is only 47.3 MB and the inference results were close to the model before quantization, atleast with the samples that we had trained the original network.

I had a question about the quantization approach that is adopted, is the quantization going to only compress the model size for storage, is the resulting time for inference going to the same as the previous/bigger model?

Is there a plan to perform quantization aware training for DeepSpeech, so that the resulting inference is also faster and more accurate?

lissyx · March 25, 2019, 9:13pm

All the details are in TFLite + Quantization · Issue #1850 · mozilla/DeepSpeech · GitHub

Not as of now, perfs are good enough for the current use cases priorities and we have to optimize the language model that is currently blocking us from having something complete.

Topic		Replies	Views
How to use the pretrained tflite model? DeepSpeech	33	6266	May 6, 2020
Video and benchmarking results DeepSpeech	15	1608	February 6, 2020
Deep Speech optimization in production DeepSpeech	26	1675	March 13, 2020
Prebuild deep speech binary for tensorflow lite model on raspberry pi 3? DeepSpeech	32	4333	September 26, 2019
Inference prediction with own trained model DeepSpeech	9	1416	September 19, 2018

Google's On-Device Speech Recognizer

Related topics