Hi,
I would like to discuss the model compression here. Correct me if I am wrong.
So far, there is only TFlite quantization (32FP->8FX, 4x reduction). I want to compress the model further (compress to 50x~100x reduction). The plan is mentioned below.
Related guide
Example that compressed DeepSpeech
Google optimizes their streaming RNN-T
Step 1. Pruning
Pruning is a popular compression method. Structure pruning(prune the neuron or channel) is preferred because weight pruning(just set low value weight to zero) can’t actually reduce memory consumption (but weight pruning undoubtedly increases the speed tho). I wonder if DeepSpeech support structured pruning.
This paper shows that large sparse network is consistently better than small dense network. I prefer train an large network with structure pruning rather than just tune the n_hidden
. However, in this way I may have to modify the training architecture and procedure.
Expected reduction ratio: 2~3
Step 2. Low rank approximation/factorization
I don’t survey this part very well. Basically transform an M x N weight matrix to M x K + K x M. Sparsity is important factor here. My plan is applying conventional SVD.
Expected reduction ratio: 3~5
Step 3. Quantization
While the TFlite can only quantize weight to 8-bit. I would try other methods. Methods that don’t need modify back propagation and model architecture is preferred. In other words, I would only use post training quantization. Symmetric/Asymmetric quantization can’t work for low-bit quantization (<8bit). Traditional K-means clustering is performing well consistently. Multibit quantization is successful for RNN. Data-free quantization can be applied without retraining. OMSE and Greedy approximation directly minimize the Minimum Square Error to achieve good performance.
Expected reduction ratio: 5~8 (32bit to 6bit~4bit)
Knowledge distillation and dedicated structure would not be discussed here since they usually work for specific model.
Expected total reduction ratio: 30~120
Weight pruning and Step 2 and 3 can be done after training. I can literally get the every weight from every neuron. Then just transform it and see the performance by running inference. Retrain may be needed but it’s not a big problem. I discuss here mainly just to seek more advice. Like will it work for RNN/LSTM or end-to-end ASR to be more specific.
Step 1 is more complicated since I have to do structure pruning while training. So, I need to put effort on going through all training architecture and training procedure, which is the task I don’t really want to do. Fortunately, pruning is build on Tenforflow so I would like to ask for some help here. That being said, any advice and thoughts about all of my procedure or other compression methods would be appreciated.
My little advice: Size means feasibility. DeepSpeech should release some basic compressing function (if they really do, then at least make a recipe/guide). This can let both research team and users have much more space to develop.
Thanks!