NVIDIA gets 3500x realtime ASR on Kaldi

I’m intrigued to know what they optimized to achieve these results, although I’m not sure how much would be directly applicable to DeepSpeech. But I thought it was interesting that they made optimizations to the model at training time to make it work faster on GPUs later on during inference.

Well, their patch is pubic: https://github.com/kaldi-asr/kaldi/pull/3114/files

A single speech-to-text is not enough work to fully saturate any NVIDIA GPUs.
To fully saturate GPUs we need to decode many audio files concurrently.  The
solution provide does this through a combination of batching many audio files
into a single speech pipeline, running multiple pipelines in parallel on the
device, and using multiple CPU threads to perform feature extraction and 
determinization.  Users of the decoder will need to have a high level 
understanding of the underlying implementation to know how to tune the 
decoder. 

So basically it also matches a specific usecase with a lot of processing done at the same time.

Oh, that’s a lot less impressive then.

Well, still impressive and interesting to know you can improve stuff like that, though I’m unsure about the usecase in the wild.

A couple usecases some to mind:

  • Client-server, on-line STT, batching several requests “near” each other in time to maximize GPU utilization (Baidu did this several years ago for their on-line STT system)
  • A STT service that transcribes large quantities of audio, e.g. transcribing all the audio of a national library.

That kind of concurrent operation is hard with DeepSpeech due to the fact that it takes up so much RAM. I’m assuming that’s because the model is loaded into memory in its entirety. Presumably that will improve once compression and/or the TFLite model become available.

@dabinat You can export the TFLite model using master, see instructions here, and it’s about 46MB. So unless you’re on some incredibly small footprint device, memory is not a problem.