Online Decoding (since v2)

Now that Deepspeech is using uni-RNN (LSTM) instead of BiRNN that it used. Is there any API/endpoint that can be used for online decoding, specially for its latest stable version release, v0.4.1?

We have the streaming API, is it what you need ? If not, please explain.

Thank you. Sorry for the delay (time zone, sigh). I’ll go through the documentation again. I haven’t found the streaming API. If I find, I’ll link it here; else @lissyx could you provide a link to where the streaming API documentation is.

Thank you.

@lissyx, are you referring to this:
https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/

and specifically this part:

Here’s a small Python program that demonstrates how to use libSoX to record from the microphone and feed it into the engine as the audio is being recorded

Does this do real time transcribing? I mean, the transcribing is happening only after the end of speech. Is it possible to do real time transcription, as the user speaks, the model transcribes (and not at the end of speaking). The issue seems to be with this line:

model.finishStream(sctx)

The transcription is being provided after this line; can we not get continuous transcription?

Going through the codes, I found a method:

model.intermediateDecode(sctx)

Is this the only API? It seems, it goes on transmitting the previous data, even when the speaker has stopped speaking

Yes and no, we still lack the streaming decoder, there’s a github issue open. However, using VAD you can already do something interesting. An example exists in mozillaspeechlibrary: https://github.com/mozilla/androidspeech/blob/master/mozillaspeechlibrary/src/main/java/com/mozilla/speechlibrary/LocalSpeechRecognition.java

Hey, @lissyx thanks a lot. I am already using VAD for start and end of speech for a gooduser experience (where the user needn’t press the stop button). I shall go through the code and see what it says. Thanks a lot. In the meantime, the method:

model.intermediateDecode(sctx)

works well, and using vad and this together along with some functionality to avoid redundancy (thos method goes on emitting same output), I think it could be taken care of pretty nicely.

Could you just summarise on what should be kept in mind while decoding a stream (i.e. what’re the building blocks, other than VAD)?

I don’t get your question.

Alright the question has two question part and one comment part.

Comment part: I’m using VAD as speech start and end detection in the frontend

Q1: I’m planning the following, do you think this is a vaible idea: Stream audio, get the intermediate output using model.intermediateDecode and then using VAD detect End Of Speech and call the function model.finishStream. Using regex and string function, refine the streaming output of model.intermediateDecode. Is this a viable plan, or is there a glaring caveat I’m missing?

Q2: For streaming decoding to take place, what’re the building blocks that needs t be kept in mind, other than VAD (can you direct me to some theory) ?

Please refer to the github issue

I still don’t understand what you want to do.

Ok. Thanks will do it.

Let me try it one more time:

  1. Start streaming
  2. Get intermediate output using model.intermediateDecode (O1)
  3. Using a function check if O1prev == O1current, if same: ignore, else: send the updated O1 to frontend (I’m using websockets)
  4. Using VAD, detect end of speech (EOS)
  5. After detecting EOS, run model.finishStream (O2)
  6. Send the final output (O2) to user and for further NLP.

This code:

output_prev = ""
while True:
    data = subproc.stdout.read(512)
    # print(data)
    model.feedAudioContent(sctx, np.frombuffer(data, np.int16))
    output = model.intermediateDecode(sctx)
    if output_prev == output:
      pass
    else:
      print(output)
      output_prev = output

(Instead of print send it to frontend)

Why wouldn’t it work? But I can’t give anymore hints, you need to do your own homework.

Sure… It’s working and thanks… I’ll take it up from here…