Google's On-Device Speech Recognizer

Hopefully this is inspiration rather than disheartening! It seems incredible they’ve got the model down to 80 Mb!

1 Like

Our TFLite quantized model is 46MB and runs ~ twice real time on Android, and this is in our repo for a few weeks now. We still have work to do to get good language model there, current ones are too big.

8 Likes

Excellent news - thanks for that info, I hadn’t realised and will take a look when I get back from work.

They have released the offline part for the Pixel devices on Android only. No news for the iOS counterpart yet.

Do you guys have info. about a standalone offline iOS model from DeepSpeech?

I just looked a bit more in detail at Google’s post.

Their claim of 80MB is for a compressed model. After zipping, Mozilla’s acoustic model is 23.2MB, much much smaller than Google’s 80MB.

(The original 1,2 MB I had here was in the size column of the finder but that was, confusingly, only the size currently uploaded to iCloudDrive, not the size of the file.)

1 Like

we don’t have plans for that as of now

Could you please redirect me to someone who is working on the iOS standalone part?
If you guys have interest or future plans on that, I have some ideas so as how the same can be optimised/implemented on the iOS platform.
I may be wrong, but it can help the entire community and may help the Mozilla’s DeepSpeech open-source project in return.

As I said, we have no plan to work on iOS support for now, we don’t have enough resources to take care of that. If you are able to get it running, feel free to fork and send patches.

2 Likes

@lissyx It is a good to know that deepspeech quantization effort is better than the Google’s result! Is the TFLite quantized model of 46 MB available as pretrained model for testing as part of the release? If not, is there a procedure to generate a quantized model from the checkpoints that are obtained based on default deepspeech training process?

It’s all documented in README files …

@lissyx There is a " Exporting a model for TFLite" in the README but this does not talk about quantization. We typically use this step to export the model into TFLIte from the checkpoints that we had trained and this produces almost 189MB of an acoustic model which is close to the .pbmm model.

Well, please read the code and report issues, because it’s working very well here. Quantization was enabled a few weeks ago at TFLite export.

Just FYI. I can confirm the TFLite model works on Android. The problems I have right now are:

  1. The accuracy is not good enough. E.g. it will not correctly recognize “one two three four”. Not sure if it’s related to the model conversion. I’ll set up a server version for comparison.

  2. The LM is too huge. If we can get it down to a few hundreds of MB (or even less) it would be suitable to run on Android directly.

-rw-r–r-- 1 li li 213 Mar 14 11:05 ._alphabet.txt
-rw-r–r-- 1 li li 329 Mar 14 11:05 alphabet.txt
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._lm.binary
-rw-r–r-- 1 li li 1.7G Mar 14 11:05 lm.binary
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.pb
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.pbmm
-rw-r–r-- 1 li li 181M Mar 14 11:04 output_graph.rounded.pb
-rw-r–r-- 1 li li 181M Mar 14 11:05 output_graph.rounded.pbmm
-rw-r–r-- 1 li li 46M Mar 14 11:05 output_graph.tflite
-rw-r–r-- 1 li li 21M Mar 14 11:05 trie
-rw-r–r-- 1 li li 213 Mar 14 11:05 ._trie

That depends on a lot of parameters, our testing with native american speakers gave good results, and myself being a non native american I could get some things pretty well. We measured WER impact and found that TF Lite prior to quantization was ~8.5% WER and with quantization it was ~10.2%

Yes, we know, this still needs somework and that is why the code is in the repo but not too much advertised about yet, or why there is not yet a .tflite file officially released.

Is the 8.5 or 10.2 WER with or without LM? Yes, I’m not a native speaker which could be one of the reasons. Maybe I need set up a standard test environment with WSJ or LibriSpeech.

As much as I recall it was with the LM.

Another parameter was also the way the microphone is being accessed. It’s freshly merged, but no release yet, but you can use mozillaspeechlibrary with DeepSpeech now: https://github.com/mozilla/androidspeech/

And specifically, when working on that, I found that parameter to have a real impact: https://github.com/mozilla/androidspeech/commit/2bf0774519fa58249e214bfc34b72b1e742d50a1

@lissyx I was able to generate a Quantized version of TFLite model which is only 47.3 MB and the inference results were close to the model before quantization, atleast with the samples that we had trained the original network.

I had a question about the quantization approach that is adopted, is the quantization going to only compress the model size for storage, is the resulting time for inference going to the same as the previous/bigger model?

Is there a plan to perform quantization aware training for DeepSpeech, so that the resulting inference is also faster and more accurate?

All the details are in https://github.com/mozilla/DeepSpeech/issues/1850

Not as of now, perfs are good enough for the current use cases priorities and we have to optimize the language model that is currently blocking us from having something complete.