Deep Speech optimization in production

sayantangangs.91 · March 10, 2020, 6:50am

RIght now, it takes about 3.5 to 4 sec to decode a 6-7 second file in CPU, in production, excluding network and other latencies. I wanted it to be reduced to about 1-1.5 sec range (excluding any other latency), probably with a lighter model. Is that possible?

kdavis · March 10, 2020, 8:41am

Are you using the streaming interface?

If not, why did you decide against the streaming interface if latency is a priority?

lissyx · March 10, 2020, 8:44am

Weekends aren’t a thing maybe for you. It is for me. Reuben is on hoildays for several weeks. So yeah, understand that this behavior can be seens as super rude.

As documented, just use the deepspeech_tflite python wheel dependency instead of deepspeech. No need to change any of your code, even the import.

The TFLite model is 46MB. Again, this makes no sense, I don’t get what you are talking: do you mean you want to add tflite export ? What “my model” are you referring to ?

Have you read the documentations ?

We already have TFLite model and exports, why don’t you just use that ?

That’s a 6x realtime target. Except with GPU or much faster CPU, I doubt you can achieve that with the default model.

Again, have you studied the complexity of the network? LSTM layers makes it mostly quadratic, so you can easily see how to fit in your pattern.

Without more context on what kind of usecase you are thinking of, it’s really hard to help you.

Besides, nothing is hidden, source code and model are available and documented. Nothing that you ask is not something you could just read and evaluate yourself: right now, I feel like you are asking me to do your job, even pressuring me on doing it during the weekend.

Please, there are extensive quantity of material and discussions here and on github issues related to tflite, quantization etc. Read them to get a grasp of everything that has been already tested on the network.

sayantangangs.91 · March 11, 2020, 2:11pm

Sorry for the same.

Thanks, I shall try that out

I’ve used transfer learning to train my model in specfic dialect. I’m not using tflite. I was thinking of exporting the current retrained model that I have to tflite and use it.

As I said, it’s a retrained model.

Yes, I was trying to figure out how that could be done.

I’m working on a voice interface for the healthcare sector. So, inference speed does matter there.

I was going through some of them.
Thanks a lot for your response.

lissyx · March 11, 2020, 2:16pm

Well, just export it to tflite format ?

You are already twice faster than realtime, is this really going to be your bottleneck ? Do you have data supporting this ?

How what could be done ?

sayantangangs.91 · March 11, 2020, 2:23pm

Yes. I’ll see if DS 0.5.1 has deepspeech_tflite. I’ll use that so that I can use the functionality. I was also trying to figure out what’re the tensorflow optimization that you did while exporting.

I’ve spoken to a couple of doctors asking for more real-time. They were vague, but that was their requirement. Hence the need.

I was thinking how could this be done with AWS EC2 instances. Probably by using AWS Sagemaker Neo, if there is some optimized model that could then be used with AWS inferentia - that can speed things up theoretically. If I get it done, I shall get back.

lissyx · March 11, 2020, 2:35pm

This is well documented and visible in the code … https://github.com/mozilla/DeepSpeech/blob/v0.5.1/DeepSpeech.py

search for export_tflite flag …

Please be more specific. I don’t know what “more real-time” means.

sayantangangs.91 · March 11, 2020, 2:58pm

Thanks. I shall check them.

I’m yet to get a good benchmark. Hence, can’t be very specific, but surely they felt waiting 3 sec for a 6 sec file isn’t good - and as doctors are clients, I’m trying to find out what could be done. However, I cannot quantify “more real-time” now.

Thanks a lot for your quick responses.

lissyx · March 11, 2020, 3:01pm

Seriously, read the documentation. Using streaming, if you have 2x realtime, you are covered.

sayantangangs.91 · March 11, 2020, 3:05pm

I started a thread a about 3 months back on some specific problems we’re facing with streaming. For example, if the sentence is:
“I can hear you” and the two parts of the stream are “I can” and “hear you”, the decoded output is
“I can” and “here you”
I haven’t worked on that part.

lissyx · March 11, 2020, 3:19pm

Well, I can’t comment out of context.

kdavis · March 12, 2020, 10:05am

Why can the application simply use streaming? Concatenating two, or more, strings seems like an easy problem to solve.

For more details on streaming see the blog post on the 0.6 release, in particular Consistent low latency section. To give you a feel for performance, there a representative example is given of a clip that is a few seconds long giving a latency of 260ms, which is basically instantaneous.

sayantangangs.91 · March 12, 2020, 11:13am

It is not about concatenation. Since the second part of the stream doesn’t have reference to the first part, that results in error in decoding.
So if the real sentence is “I can hear you”, the streaming out put was (tried 3 months back with 0.5.1 Deepspeech):
“I can” and “here you”. Mark the difference between “hear” and “here”.
I haven’t been able to solve it and hence, didn’t get this issue up again.

lissyx · March 12, 2020, 11:20am

So basically you are basing your assumptions on an older release, when we repeatedly mentionned that newer versions should improve.

Your behavior could be explained by so many implementations details, and yet you are focusing on solving the wrong issue.

kdavis · March 12, 2020, 12:21pm

The blog post I linked to, which I guess you didn’t look at, indicates that streaming has improved a lot with respect to latency since 0.5.1.

reuben · March 12, 2020, 1:40pm

The streaming API gives the exact same output as feeding the entire file at once. If you’re seeing differences, it’s likely a bug in your code. For example, you could be dropping frames.

sayantangangs.91 · March 13, 2020, 9:41am

Yes. The reason being I’ll have to train with transfer learning on 0.6.x. I’m reading all 0.6.x files now and thanks a lot, I’ll take this up now.

I’m looking at it now and will take this up and see how fast we can move to 0.6.x (as we’ve to retrain as well).

I’m looking at it. Surely that could be the case. I’ll analyse it if that is the real problem. But I’ll do that once we update to the newer version.

Just a small thing here, for 0.7.x-stable, what’re the timeline?

lissyx · March 13, 2020, 9:43am

Please rely on current master for transfer learning. The old transfer-learning2 branch is deprecated and dead, features has been merged properly.

sayantangangs.91 · March 13, 2020, 9:49am

Okay. So I’ll use the master branch. That seems a major change. Congrats for the merger.

lissyx · March 13, 2020, 9:50am

Congrats goes to @josh_meyer and @reuben. I just played the role of the painful guy constantly asking @josh_meyer to merge.