Capacity need for real time

Hello!

I am asking your opinion about the following subject:

We have 10 on going phonecalls (8000hz, mono, average phonecall lenght is 5min) all the time. If we would like to do speech to text near realtime, what kind of capacity requirements we need ? How many CPU’s it takes to handle this kind of load ? Does GPU do any big difference in decoding ? Can deepspeech handle streaming data (wavs).

We are planning to do realtime assistant and we need to calculate costs to run it (AWS, google … )

  1. convert phonecall to 16000hz and mono
  2. stream converted audio to model …
  3. save results to sql … or something

All guesses are welcome or even better if you have some experience from this kind of case.

Thanks in advance!

DeepSpeech already converts audio using SoX to the correct format (see: line 42 of native_client/python/client.py)
The memory mapped model can save resources and can be generated easily from the task cluster py file.
As far as benchmarks, I’ve run without gpu on an ultra book (i7 vPro chip) and inference time was ~.6 seconds for samples ~5 seconds in length. This assumes the model is already loaded to memory as a rest API.

Hope that helps.

Edit: just saw you wanted streaming, sorry. Maybe the streaming examples in the Deepspeech repo would help. I don’t think they convert on the fly.

Hello, and thanks for quick reply.

I think that “streaming part” can be made in different way: Split audio lenght of 5min to 10 sec pieces and do S2T to this small wav, write results somewhere, and receive next 10 sec piece … so, referring to your numbers, that 10 sec might take 1-2 sec to process, which is fast enough …

My devils advocate response is that you’d want to consider voice activity detection based chunks instead of geometrical chunks of audio.

1 Like