I’m working on production of the Deepspecch. I’ve been using the following strategy till now:
Now, to improve the latency, I’m currently working on streaming the audio. However, there seems to be a problem. It seems the accuracy on streaming is comparatively lower than that of STT on complete file.
It is possibly because of the language model. By this I mean, if the full file is being transcribed, the language model probability works on the complete file and the result is thereafter provided. However, if I’m using streaming, the VAD breaks the sequence in the middle, hence the language model doesn’t get the previous context and that adds minor errors to the output.
Is this normal? How can I overcome them? Should I add single word sentences in the language model as well?