Deep Speech optimization in production

I was thinking what’re the general ways I can optimize for latency?
Can I use tflite model for deepspeech interpretation in python, i.e. if I import deepspeech and in the model link it to the optimized (optimeze for latency) tflite file?
That way, we can use AWS Lambda for production - as the model size can reduce from 190 MB (deepspeech 0.5) to around 10 MB.

Hey @lissyx @reuben, any suggestions? That would be really helpful/ I understand different timezones, no issues. Take your time.

If you use the tflite version, you will obviously loose quality. A model with just 5% the size won’t give you as much as a full model.

Search the forum for inference and you’ll find that some people are already working on running inference in parallel on GPUs. My guess is that it’s not a good idea to already move it to Lambda.

If you understand, why pinging people within 24h of posting the topic in the middle of a weekend ?

Please clarify, I don’t understand what you mean.

I don’t know where you saw a 10MB model.

1 Like

To be more concrete, what is your latency target?

Yes, I’m okay with little loss of quality, vs a lot of gain in latency and possibly a smaller model to load in AWS lambda. I dont want to use inference in production to happen in GPU, as unless batched GPU will not give a huge advantage over optimised CPU, right?

I’m really sorry. As weekends aren’t a thing. Please accept my regret.

I’m using deepspeech utility to infer in production. Which goes as:

import deepspeech as ds
......
ds =  Model(model, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)

If I’m using a tflite model, can I use this functionality/utility?

Sorry for the number. Point was, currently my model is about 190 MB, I want it reduced as much as possible using tflite, pruning and see what’s the reduced size and how good the accuracy holds.

RIght now, it takes about 3.5 to 4 sec to decode a 6-7 second file in CPU, in production, excluding network and other latencies. I wanted it to be reduced to about 1-1.5 sec range (excluding any other latency), probably with a lighter model. Is that possible?

Are you using the streaming interface?

If not, why did you decide against the streaming interface if latency is a priority?

Weekends aren’t a thing maybe for you. It is for me. Reuben is on hoildays for several weeks. So yeah, understand that this behavior can be seens as super rude.

As documented, just use the deepspeech_tflite python wheel dependency instead of deepspeech. No need to change any of your code, even the import.

The TFLite model is 46MB. Again, this makes no sense, I don’t get what you are talking: do you mean you want to add tflite export ? What “my model” are you referring to ?

Have you read the documentations ?

We already have TFLite model and exports, why don’t you just use that ?

That’s a 6x realtime target. Except with GPU or much faster CPU, I doubt you can achieve that with the default model.

Again, have you studied the complexity of the network? LSTM layers makes it mostly quadratic, so you can easily see how to fit in your pattern.

Without more context on what kind of usecase you are thinking of, it’s really hard to help you.

Besides, nothing is hidden, source code and model are available and documented. Nothing that you ask is not something you could just read and evaluate yourself: right now, I feel like you are asking me to do your job, even pressuring me on doing it during the weekend.

Please, there are extensive quantity of material and discussions here and on github issues related to tflite, quantization etc. Read them to get a grasp of everything that has been already tested on the network.

Sorry for the same.

Thanks, I shall try that out

I’ve used transfer learning to train my model in specfic dialect. I’m not using tflite. I was thinking of exporting the current retrained model that I have to tflite and use it.

As I said, it’s a retrained model.

Yes, I was trying to figure out how that could be done.

I’m working on a voice interface for the healthcare sector. So, inference speed does matter there.

I was going through some of them.
Thanks a lot for your response.

Well, just export it to tflite format ?

You are already twice faster than realtime, is this really going to be your bottleneck ? Do you have data supporting this ?

How what could be done ?

Yes. I’ll see if DS 0.5.1 has deepspeech_tflite. I’ll use that so that I can use the functionality. I was also trying to figure out what’re the tensorflow optimization that you did while exporting.

I’ve spoken to a couple of doctors asking for more real-time. They were vague, but that was their requirement. Hence the need.

I was thinking how could this be done with AWS EC2 instances. Probably by using AWS Sagemaker Neo, if there is some optimized model that could then be used with AWS inferentia - that can speed things up theoretically. If I get it done, I shall get back.

This is well documented and visible in the code … https://github.com/mozilla/DeepSpeech/blob/v0.5.1/DeepSpeech.py

search for export_tflite flag …

Please be more specific. I don’t know what “more real-time” means.

Thanks. I shall check them.

I’m yet to get a good benchmark. Hence, can’t be very specific, but surely they felt waiting 3 sec for a 6 sec file isn’t good - and as doctors are clients, I’m trying to find out what could be done. However, I cannot quantify “more real-time” now.

Thanks a lot for your quick responses.

Seriously, read the documentation. Using streaming, if you have 2x realtime, you are covered.

I started a thread a about 3 months back on some specific problems we’re facing with streaming. For example, if the sentence is:
“I can hear you” and the two parts of the stream are “I can” and “hear you”, the decoded output is
“I can” and “here you”
I haven’t worked on that part.

Well, I can’t comment out of context.

Why can the application simply use streaming? Concatenating two, or more, strings seems like an easy problem to solve.

For more details on streaming see the blog post on the 0.6 release, in particular Consistent low latency section. To give you a feel for performance, there a representative example is given of a clip that is a few seconds long giving a latency of 260ms, which is basically instantaneous.

It is not about concatenation. Since the second part of the stream doesn’t have reference to the first part, that results in error in decoding.
So if the real sentence is “I can hear you”, the streaming out put was (tried 3 months back with 0.5.1 Deepspeech):
“I can” and “here you”. Mark the difference between “hear” and “here”.
I haven’t been able to solve it and hence, didn’t get this issue up again.