Streaming. The best way to transfer state to a new stream

Hello, I’m building a system that does STT in realtime. I’m using python stream API. Sessions in my application are around 5 minutes, but I need to give instant feedback. For example, if the user said ‘go to page 40’, I need to show it in UI.
I use the method intermediateDecodeWithMetadata.

Let’s say the model recognized string “Go to page 40 start reading”, in this sentence I can recognize a command “Go to page 40”, but then I want the stream to start decoding text from the phrase “start reading”, but not from the beginning of the session.
I’ve found two solutions for now.

First: I finish the current stream, then I create a new stream and pass an audio buffer, starting from the time where the phrase “Go to page 40” ended, I can get this info from metadata. This doesn’t work well. Sometimes it doesn’t recognize the phrase “start reading” after I cut the audio.

Second: I never finish the stream, but I remember timestamps where I finished processing a phrase. This works well actually, but I have a problem that from that point I can’t use confidence. I use it to filter false positive words, but if the sentence is very long and there were some true positive words then filtering based on confidence doesn’t work.

Do you have any suggestions on how to solve this problem?

That would look like the simplest and more robust approach. Did you tried overlapping buffers? Some people reported sometimes troubles with audio starting, and some added a few ms (50-100) of blank silence to help, but we never have had the chance to actually characterize a reproducible issue here.

So maybe if you keep a few ms of your previous buffers and you feed them into your new stream, it could help?

1 Like