Deep Speech tensorflow pruning

To serve the model for production, have you’ll worked on pruning the model? I was thinking if you’ve done anything like that, as that could really speed things up.
One additional question, not specific to DS, is it wise to serve a Deep Speech tflite model in production via server (so that it speeds thing up?)

Can you elaborate instead of asking fancy questions like you have the answer and we made a mistake ?

Everything we do is in the code, so just take a look.

That depends. Define wise, define your problem, your constraints and your expectations.

My aim is to use any optimization method to reduce model size and speed up latency. With that in mind, I was thinking if you’ve used/plan to use tensorflow pruning toolkit to eliminate some weak weights in the network? As far as the code goes, I haven’t seen that - although I’ve nt worked with 0.6.X, I’m working with 0.5.

My expectation is to reduce model size by 60-65% and also reduce latency as much as possible. Can I meet these two objectives?

Pruning will only remove some connections. This will have mostly 0 impact on the network complexity which is the main factor of speed / latency here.

“”“reduce latency as much as possible”"" it’s too vague.

Reducing model size by 60-65% is easy: reduce n_hidden. But then you have to retrain.

We have extensively tried TensorFlow-level quantization tooling and apply all those that works on our model.

Current big limitations are mostly tensorflow-level limitations, e.g., tflite runtime does not leverage threads on our model. Going futher in quantization breaks the model badly. Leveraging NNAPI or other TFLite delegates is not compatible with some major ops in the graph and it’s slows things down by a factor of 10.

Okay.

While I did transfer learning, but reducing n_hidden and retraining isn’t a great option now.

Thanks a lot. At Deepspeech 0.5.1, I’ll see what post transfer learning quantization I can apply on the model. Any redirects/links would be helpful.

Thanks a lot again.

You should move to 0.6.1, and there we have as much as can be enabled of quantization. Again, you will find a lot of the experiments I ran on Github.

1 Like