Poorer result in streaming vs STT on complete audio file

Dear @lissyx @reuben team and the community,

I’m working on production of the Deepspecch. I’ve been using the following strategy till now:

Record user voice using javascript in front end and send the wav file to server to wrok on.

Now, to improve the latency, I’m currently working on streaming the audio. However, there seems to be a problem. It seems the accuracy on streaming is comparatively lower than that of STT on complete file.

It is possibly because of the language model. By this I mean, if the full file is being transcribed, the language model probability works on the complete file and the result is thereafter provided. However, if I’m using streaming, the VAD breaks the sequence in the middle, hence the language model doesn’t get the previous context and that adds minor errors to the output.

Is this normal? How can I overcome them? Should I add single word sentences in the language model as well?

If you think it’s VAD, just don’t use VAD so that streaming mode should be as accurate as “wav file” mode, but much faster.

If I dont use VAD and still do streaming, wont it be a big issue if the streaming audio is large? And is adding single words in the LM for inference a good option?

Eg: If the sentence is, “Note in history transformer has a problem of …”

The first part: “Note in history” is ok. In the second part “transformer has” is not geting recognised, if I’m streaming. However, the complete sentence is getting recognised if I’m doing a complete WAV file’s STT.

What is the best way to solve it? (And what could be the reason, is it because such sentences are not there in the LM?)

1 Like

HI @sayantangangs.91

I have a similar observation. I used to test my fully customized model using batch processing and the model was predicting very well.
Now I apply streaming and I am getting weird transcript. Very often same parts of my speech are cutted, usually the start or the end of the spoken sequence.
Does anyone can help with that?

I understand that we are chunking the audio file and transcribing these chunks separately, while streaming. The main problem probably is with the language model, it is not used for the whole sentence but for these single chunks, am I right? That is probably the reason why the transcript seems so weird.
Does anyone overcome it somehow? Maybe applying some post-processing or any other ideas? Thanks in advance!