Hello, I’m building a system that does STT in realtime. I’m using python stream API. Sessions in my application are around 5 minutes, but I need to give instant feedback. For example, if the user said ‘go to page 40’, I need to show it in UI.
I use the method intermediateDecodeWithMetadata
.
Let’s say the model recognized string “Go to page 40 start reading”, in this sentence I can recognize a command “Go to page 40”, but then I want the stream to start decoding text from the phrase “start reading”, but not from the beginning of the session.
I’ve found two solutions for now.
First: I finish the current stream, then I create a new stream and pass an audio buffer, starting from the time where the phrase “Go to page 40” ended, I can get this info from metadata. This doesn’t work well. Sometimes it doesn’t recognize the phrase “start reading” after I cut the audio.
Second: I never finish the stream, but I remember timestamps where I finished processing a phrase. This works well actually, but I have a problem that from that point I can’t use confidence. I use it to filter false positive words, but if the sentence is very long and there were some true positive words then filtering based on confidence doesn’t work.
Do you have any suggestions on how to solve this problem?