Compress the model

1258 · March 26, 2020, 1:48pm

Hi,

I would like to discuss the model compression here. Correct me if I am wrong.

So far, there is only TFlite quantization (32FP->8FX, 4x reduction). I want to compress the model further (compress to 50x~100x reduction). The plan is mentioned below.

Related guide
Example that compressed DeepSpeech
Google optimizes their streaming RNN-T

Step 1. Pruning
Pruning is a popular compression method. Structure pruning(prune the neuron or channel) is preferred because weight pruning(just set low value weight to zero) can’t actually reduce memory consumption (but weight pruning undoubtedly increases the speed tho). I wonder if DeepSpeech support structured pruning.
This paper shows that large sparse network is consistently better than small dense network. I prefer train an large network with structure pruning rather than just tune the n_hidden. However, in this way I may have to modify the training architecture and procedure.
Expected reduction ratio: 2~3

Step 2. Low rank approximation/factorization
I don’t survey this part very well. Basically transform an M x N weight matrix to M x K + K x M. Sparsity is important factor here. My plan is applying conventional SVD.
Expected reduction ratio: 3~5

Step 3. Quantization
While the TFlite can only quantize weight to 8-bit. I would try other methods. Methods that don’t need modify back propagation and model architecture is preferred. In other words, I would only use post training quantization. Symmetric/Asymmetric quantization can’t work for low-bit quantization (<8bit). Traditional K-means clustering is performing well consistently. Multibit quantization is successful for RNN. Data-free quantization can be applied without retraining. OMSE and Greedy approximation directly minimize the Minimum Square Error to achieve good performance.
Expected reduction ratio: 5~8 (32bit to 6bit~4bit)

Knowledge distillation and dedicated structure would not be discussed here since they usually work for specific model.

Expected total reduction ratio: 30~120

Weight pruning and Step 2 and 3 can be done after training. I can literally get the every weight from every neuron. Then just transform it and see the performance by running inference. Retrain may be needed but it’s not a big problem. I discuss here mainly just to seek more advice. Like will it work for RNN/LSTM or end-to-end ASR to be more specific.

Step 1 is more complicated since I have to do structure pruning while training. So, I need to put effort on going through all training architecture and training procedure, which is the task I don’t really want to do. Fortunately, pruning is build on Tenforflow so I would like to ask for some help here. That being said, any advice and thoughts about all of my procedure or other compression methods would be appreciated.

My little advice: Size means feasibility. DeepSpeech should release some basic compressing function (if they really do, then at least make a recipe/guide). This can let both research team and users have much more space to develop.

Thanks!

lissyx · March 26, 2020, 2:04pm

We have already tried mostly everything that should be easily applicable. Including the documented post training quantization that moves to int8. This succeeded at the expense of invasive (and potentially unwanted) graph changes. Unfortunately, the resulting network was ~5-10 times slower to run inference.

This basically summarize each and every quantization attempt reusing TensorFlow tooling that we tried: either it was utterly much slower because of the extra quant/unquant ops, or it would be just unsupported because of the ops we use in the graph, including the LSTM layers (so you can see it is complicated to overcome that).

Agree. Please consider that we did try extensively to leverage TensorFlow features regarding that. If it’s not enabled as of today, it’s because it either badly failed or it was unsupported in some way.

You can find reports if you dig into Discourse and Github (sorry, I have not kept a list of all the failed attempts, it goes back in time quite a lot, and some of the information was just on IRC).

If you find efficient ways, you are obviously welcome.

1258 · March 26, 2020, 3:02pm

I recommend you guys trying some other open source projects since post training quantization is very limited. I use it just because I don’t want to modify the architecture/procedure and fine tune again by myself.

The structure pruning is the only method mentioned that is not post training quantization, so the only method is to set n_hidden, right? I’ll try it and hope results are not too bad

Although the compressing functions are not supported, I recommend you releasing some compressed language model, that would still be a huge progress. People definitely don’t want to create, compress and fine tune it if not necessary even though KenLM is easy to get started.

lissyx · March 26, 2020, 4:27pm

Well, we have very much not a lot of bandwidth. I really spent months trying a lot of things.

If you remove nodes, it decreases complexity, so it should improve. But as I said, we also tried a lot of other things that should have been very fast and turned not to work well or not at all because of limitations within tensorflow or with our model.

Do you understand what I am stating ? You seem to imply we are hiding things or not doing our job seriously. This is really completely false.

I’d be very happy to see progresses on that, but even the latest tentatives I tried on TFLite with using the representative_dataset turned out to be wrong.

Maybe I’m just dumb?

lissyx · March 26, 2020, 4:41pm

@1258 Let’s be very clear: we are a small team. I already spent a huge amount of work trying to use as much as possible of TensorFlow’s related optimizations, often resulting in failures / github issues filing. The current status is the best we could come up on a reasonable amount of work and allowing to achieve good enough performances on a nice range of devices.

Anything that can improve and reduce the complexity of the model is welcome, but for the time being, and given the current sanitary situation, it is even more complicated.

If you are able to achieve reducing complexity at reasonable expense regarding the model, you’re more than welcome to share that.

1258 · March 26, 2020, 6:19pm

OK, thanks for your contribution!