Online Decoding (since v2)

sayantangangs.91 · April 1, 2019, 7:27pm

Now that Deepspeech is using uni-RNN (LSTM) instead of BiRNN that it used. Is there any API/endpoint that can be used for online decoding, specially for its latest stable version release, v0.4.1?

lissyx · April 1, 2019, 7:30pm

We have the streaming API, is it what you need ? If not, please explain.

sayantangangs.91 · April 2, 2019, 2:48am

Thank you. Sorry for the delay (time zone, sigh). I’ll go through the documentation again. I haven’t found the streaming API. If I find, I’ll link it here; else @lissyx could you provide a link to where the streaming API documentation is.

Thank you.

sayantangangs.91 · April 2, 2019, 3:08am

@lissyx, are you referring to this:
https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/

and specifically this part:

Here’s a small Python program that demonstrates how to use libSoX to record from the microphone and feed it into the engine as the audio is being recorded

Does this do real time transcribing? I mean, the transcribing is happening only after the end of speech. Is it possible to do real time transcription, as the user speaks, the model transcribes (and not at the end of speaking). The issue seems to be with this line:

model.finishStream(sctx)

The transcription is being provided after this line; can we not get continuous transcription?

Going through the codes, I found a method:

model.intermediateDecode(sctx)

Is this the only API? It seems, it goes on transmitting the previous data, even when the speaker has stopped speaking

lissyx · April 2, 2019, 5:47am

Yes and no, we still lack the streaming decoder, there’s a github issue open. However, using VAD you can already do something interesting. An example exists in mozillaspeechlibrary: https://github.com/mozilla/androidspeech/blob/master/mozillaspeechlibrary/src/main/java/com/mozilla/speechlibrary/LocalSpeechRecognition.java

sayantangangs.91 · April 2, 2019, 10:30am

Hey, @lissyx thanks a lot. I am already using VAD for start and end of speech for a gooduser experience (where the user needn’t press the stop button). I shall go through the code and see what it says. Thanks a lot. In the meantime, the method:

model.intermediateDecode(sctx)

works well, and using vad and this together along with some functionality to avoid redundancy (thos method goes on emitting same output), I think it could be taken care of pretty nicely.

Could you just summarise on what should be kept in mind while decoding a stream (i.e. what’re the building blocks, other than VAD)?

lissyx · April 2, 2019, 10:36am

I don’t get your question.

sayantangangs.91 · April 2, 2019, 10:44am

Alright the question has two question part and one comment part.

Comment part: I’m using VAD as speech start and end detection in the frontend

Q1: I’m planning the following, do you think this is a vaible idea: Stream audio, get the intermediate output using model.intermediateDecode and then using VAD detect End Of Speech and call the function model.finishStream. Using regex and string function, refine the streaming output of model.intermediateDecode. Is this a viable plan, or is there a glaring caveat I’m missing?

Q2: For streaming decoding to take place, what’re the building blocks that needs t be kept in mind, other than VAD (can you direct me to some theory) ?

lissyx · April 2, 2019, 10:48am

Please refer to the github issue

I still don’t understand what you want to do.

sayantangangs.91 · April 2, 2019, 11:14am

Ok. Thanks will do it.

Let me try it one more time:

Start streaming
Get intermediate output using model.intermediateDecode (O1)
Using a function check if O1prev == O1current, if same: ignore, else: send the updated O1 to frontend (I’m using websockets)
Using VAD, detect end of speech (EOS)
After detecting EOS, run model.finishStream (O2)
Send the final output (O2) to user and for further NLP.

This code:

output_prev = ""
while True:
    data = subproc.stdout.read(512)
    # print(data)
    model.feedAudioContent(sctx, np.frombuffer(data, np.int16))
    output = model.intermediateDecode(sctx)
    if output_prev == output:
      pass
    else:
      print(output)
      output_prev = output

(Instead of print send it to frontend)

lissyx · April 2, 2019, 11:31am

Why wouldn’t it work? But I can’t give anymore hints, you need to do your own homework.

sayantangangs.91 · April 2, 2019, 11:53am

Sure… It’s working and thanks… I’ll take it up from here…

Topic		Replies	Views
Issue: Feature request: streaming decoder (fast DS_IntermediateDecode calls) DeepSpeech	3	414	September 4, 2019
Continuous Streaming DeepSpeech	2	2713	December 20, 2019
Streaming. The best way to transfer state to a new stream DeepSpeech	1	323	September 25, 2020
Is it possible to extract words as the model Stream is running? DeepSpeech	4	503	May 10, 2019
Continuous streaming without voice activity detection? DeepSpeech learning	2	993	December 6, 2019

Online Decoding (since v2)

Related topics