Deep Speech optimization in production

RIght now, it takes about 3.5 to 4 sec to decode a 6-7 second file in CPU, in production, excluding network and other latencies. I wanted it to be reduced to about 1-1.5 sec range (excluding any other latency), probably with a lighter model. Is that possible?

Are you using the streaming interface?

If not, why did you decide against the streaming interface if latency is a priority?

Weekends aren’t a thing maybe for you. It is for me. Reuben is on hoildays for several weeks. So yeah, understand that this behavior can be seens as super rude.

As documented, just use the deepspeech_tflite python wheel dependency instead of deepspeech. No need to change any of your code, even the import.

The TFLite model is 46MB. Again, this makes no sense, I don’t get what you are talking: do you mean you want to add tflite export ? What “my model” are you referring to ?

Have you read the documentations ?

We already have TFLite model and exports, why don’t you just use that ?

That’s a 6x realtime target. Except with GPU or much faster CPU, I doubt you can achieve that with the default model.

Again, have you studied the complexity of the network? LSTM layers makes it mostly quadratic, so you can easily see how to fit in your pattern.

Without more context on what kind of usecase you are thinking of, it’s really hard to help you.

Besides, nothing is hidden, source code and model are available and documented. Nothing that you ask is not something you could just read and evaluate yourself: right now, I feel like you are asking me to do your job, even pressuring me on doing it during the weekend.

Please, there are extensive quantity of material and discussions here and on github issues related to tflite, quantization etc. Read them to get a grasp of everything that has been already tested on the network.

Sorry for the same.

Thanks, I shall try that out

I’ve used transfer learning to train my model in specfic dialect. I’m not using tflite. I was thinking of exporting the current retrained model that I have to tflite and use it.

As I said, it’s a retrained model.

Yes, I was trying to figure out how that could be done.

I’m working on a voice interface for the healthcare sector. So, inference speed does matter there.

I was going through some of them.
Thanks a lot for your response.

Well, just export it to tflite format ?

You are already twice faster than realtime, is this really going to be your bottleneck ? Do you have data supporting this ?

How what could be done ?

Yes. I’ll see if DS 0.5.1 has deepspeech_tflite. I’ll use that so that I can use the functionality. I was also trying to figure out what’re the tensorflow optimization that you did while exporting.

I’ve spoken to a couple of doctors asking for more real-time. They were vague, but that was their requirement. Hence the need.

I was thinking how could this be done with AWS EC2 instances. Probably by using AWS Sagemaker Neo, if there is some optimized model that could then be used with AWS inferentia - that can speed things up theoretically. If I get it done, I shall get back.

This is well documented and visible in the code … https://github.com/mozilla/DeepSpeech/blob/v0.5.1/DeepSpeech.py

search for export_tflite flag …

Please be more specific. I don’t know what “more real-time” means.

Thanks. I shall check them.

I’m yet to get a good benchmark. Hence, can’t be very specific, but surely they felt waiting 3 sec for a 6 sec file isn’t good - and as doctors are clients, I’m trying to find out what could be done. However, I cannot quantify “more real-time” now.

Thanks a lot for your quick responses.

Seriously, read the documentation. Using streaming, if you have 2x realtime, you are covered.

I started a thread a about 3 months back on some specific problems we’re facing with streaming. For example, if the sentence is:
“I can hear you” and the two parts of the stream are “I can” and “hear you”, the decoded output is
“I can” and “here you”
I haven’t worked on that part.

Well, I can’t comment out of context.

Why can the application simply use streaming? Concatenating two, or more, strings seems like an easy problem to solve.

For more details on streaming see the blog post on the 0.6 release, in particular Consistent low latency section. To give you a feel for performance, there a representative example is given of a clip that is a few seconds long giving a latency of 260ms, which is basically instantaneous.

It is not about concatenation. Since the second part of the stream doesn’t have reference to the first part, that results in error in decoding.
So if the real sentence is “I can hear you”, the streaming out put was (tried 3 months back with 0.5.1 Deepspeech):
“I can” and “here you”. Mark the difference between “hear” and “here”.
I haven’t been able to solve it and hence, didn’t get this issue up again.

So basically you are basing your assumptions on an older release, when we repeatedly mentionned that newer versions should improve.

Your behavior could be explained by so many implementations details, and yet you are focusing on solving the wrong issue.

The blog post I linked to, which I guess you didn’t look at, indicates that streaming has improved a lot with respect to latency since 0.5.1.

The streaming API gives the exact same output as feeding the entire file at once. If you’re seeing differences, it’s likely a bug in your code. For example, you could be dropping frames.

Yes. The reason being I’ll have to train with transfer learning on 0.6.x. I’m reading all 0.6.x files now and thanks a lot, I’ll take this up now.

I’m looking at it now and will take this up and see how fast we can move to 0.6.x (as we’ve to retrain as well).

I’m looking at it. Surely that could be the case. I’ll analyse it if that is the real problem. But I’ll do that once we update to the newer version.

Just a small thing here, for 0.7.x-stable, what’re the timeline?

Please rely on current master for transfer learning. The old transfer-learning2 branch is deprecated and dead, features has been merged properly.

Okay. So I’ll use the master branch. That seems a major change. Congrats for the merger.

Congrats goes to @josh_meyer and @reuben. I just played the role of the painful guy constantly asking @josh_meyer to merge.