Using deepspeech-rs with GPU

I experience very long inference times on my desktop. I.e. 200ms for a ogg file with less than 1s in duration and no content. For a 20s regular ogg file it takes at least 6 seconds.

I’m using the prebuilt model on Rust with the deepspeech-rs binding.

I want to use my GPU. I’m using the native_client.amd64.cuda.linuxmodel already. What else can i do?

Following the build instructions of the rust crate should be enough, if you use the CUDA-enabled (that you linked) it should work transparently. cc @est31

Thanks for your response… hmm. Can I somehow check, if my project indeed uses the GPU? I’m not sure, because the inference times remain the same.

When running on GPU, TensorFlow should output a lot of informations on stdout/stderr. Have you properly setup the system so that the rust crate uses the CUDA version ?

Content or no content, the system has to analyze it … 200ms for 1s that’s 5x realtime, I don’t think it qualifies for “long inference time”. Can you clarify your expectations ?

Same, 6s of inference for a 20s audio file (ogg is not supported, so you or the crates does convert to WAV at some point) is much faster than realtime.

FTR: I’m confident it works because I have code doing that …

You know it works because you are also using deepspeech-rs?

TensorFlow doesn’t output a lot of information for me. Just that the model is loaded (I don’t know the specific message from the top of my head).

I setup the rust crate as described in their README and then switched the nativeclient to the one specified. Can I do more? deepspeech-rs only wraps the deepspeech API, so the coding should be left untouched.

How is 6s inference for a 20s audio file faster than realtime? Only if you “stream” the inference, like Google Translate does? Is that possible with deepspeech?

I would run this on a server. With those inference times you can answer how many calls per second? 2?

Thanks for your response.


But we don’t have anything really efficient yet.

Real-time means we process audio faster than it “arrives”

Again, just use the Cuda, instead of the default one. You can see that in the dockerfile of the project above

Basically what you need is batching, and this requires a lot of work, that is initiated but far from being complete. Cc @kdavis

Thank you for the input. I didn’t copy that file for some reason. Stupid mistake… :hot_face:

So, is it working now ?

Yes, with GPU I can reduce inference time from ~6.7s to ~2.8s. For the 20s file. Thanks for help again chief.

I use ryzen 3700x and geforce rtx 2070 super.

