I experience very long inference times on my desktop. I.e. 200ms for a ogg file with less than 1s in duration and no content. For a 20s regular ogg file it takes at least 6 seconds.
I’m using the prebuilt model on Rust with the deepspeech-rs binding.
I want to use my GPU. I’m using the native_client.amd64.cuda.linuxmodel already. What else can i do?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Following the build instructions of the rust crate should be enough, if you use the CUDA-enabled libdeepspeech.so (that you linked) it should work transparently. cc @est31
Thanks for your response… hmm. Can I somehow check, if my project indeed uses the GPU? I’m not sure, because the inference times remain the same.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
When running on GPU, TensorFlow should output a lot of informations on stdout/stderr. Have you properly setup the system so that the rust crate uses the CUDA version ?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
Content or no content, the system has to analyze it … 200ms for 1s that’s 5x realtime, I don’t think it qualifies for “long inference time”. Can you clarify your expectations ?
Same, 6s of inference for a 20s audio file (ogg is not supported, so you or the crates does convert to WAV at some point) is much faster than realtime.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
FTR: I’m confident it works because I have code doing that …
You know it works because you are also using deepspeech-rs?
TensorFlow doesn’t output a lot of information for me. Just that the model is loaded (I don’t know the specific message from the top of my head).
I setup the rust crate as described in their README and then switched the nativeclient to the one specified. Can I do more? deepspeech-rs only wraps the deepspeech API, so the coding should be left untouched.
How is 6s inference for a 20s audio file faster than realtime? Only if you “stream” the inference, like Google Translate does? Is that possible with deepspeech?
I would run this on a server. With those inference times you can answer how many calls per second? 2?
Thanks for your response.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
Yes
But we don’t have anything really efficient yet.
Real-time means we process audio faster than it “arrives”
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
Again, just use the Cuda libdeepspeech.so, instead of the default one. You can see that in the dockerfile of the project above
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
Basically what you need is batching, and this requires a lot of work, that is initiated but far from being complete. Cc @kdavis