Error in streaming

I have been using deepspeech where the sound blob was recorded to the full and then sent to the backend for inference. However, this is not instantaneous. Instead, I planned to use real time streaming. However, it seems the accuracy is reducing with streaming.
Could you please explain why is it so and how it can be improved?

That’s not expected, please give more context.

One particular thing about streaming you should make sure to get right is not to drop any samples during the transcription process. The streaming process happens in batches of 16 time steps/320ms, so in those bursty periods you can sometimes drop recorded samples, specially if you’re recording and transcribing in the same thread. Use threads and a queue.

Well, for eg:

“note down in diagnosis acute myeloid leukaemia”

works great if the whole file is given

In case it’s streaming, it is like
“note down in diagnosis”
“take note in”

Instead of “acute myeloid leukaemia” ? What’s your accents? It’s very much likely a case of domain-specfic words not well learnt, maybe with non american english accent, and what @reuben mentionned can also have a play here.

I get it. However, before prodution, I’m working with the example at “mic_vad_streaming”. However, I understand using threads would help in production so that the recorded files aren’t lost. Although I’m not sure DJango (I’m using django powered API) allows threading. Nonoetheless, that’s beyond the scope. Still, I shall try and use two threads, one for recording the other for transcribing and any issues I get there, I shall follow-up with you in this thread,

This is the design that we used as well in several tools using Streaming API, and it works quite well. With Python, you might want to use multiprocess though, threads are not always the best solution in this environment.



My accent is Indian English and I’ve trained on about 80 hrs of data, from the existing model (0.5.1). And the loss settles to around 12.3, with WER around 8%. Of these 80 hrs, 65 hrs data is of general Indian english and 15 hrs of domain specific voice in Indian English Accent.

@lissyx, could you please clarify if the voice model is trained decently and then I add new words to the language model, will that cause a big issue in the accuracy (For example, if lets say “alzheimer’s disease” was not in original LM, but added in updated LM, isn’t it expected to work great too?

Yes, that’s a legit point.

Thanks a lot. Shall keep in mind and update accordingly. Thank you.
Looking forward to your reply on the previous point (about LM)

That might be enough, or not, I’m unsure. Be careful about noise, etc., as well.

At least you have some data, if it’s enough or not, I don’t know. That depends on how you augmented / defined your test set, to know if the WER figures you have are accurate.

That might be a good help. Unfortunately we don’t yet have an API to add domain-specific LM on top of generic LM.

Thank you. I understand it’ll not be possible to provide an answer on whether the dataset is enough or not.

I think you got me wrong. I mean, I’m creating my own domain specific LM, with sentences, lets say like: “Note in diagnosis query acute myeloid leukaemia” and the same is also having a voice model
Now, I’m recreating another LM (I have the existing .txt file, where I’m adding new lines and creating LM from there) add adding some new lines, one of them being:
“record in diagnosis alzheimer’s disease”
Now, the AI model has never heard someone pronounce “alzheimer’s disease”. Will it be a problem? I believe AI is expected to work with unheard words as well. Right?

I just tested my model with the original LM (from Deepspeech) and following is the output:

Original sentence: “Can you hear me”

In case the whole sentence is translated, Deepspeech output: “Can you hear me”
In case the sentence is translated in parts via streaming, Deepspeech output:
part 1: “Can you”
part 2: “here me”

I believe the LM is at play on selecting between “here” and “hear”. In streaming its not getting the reference. Or am I missing something?

Streaming should pick up the context, as much as I remember. Maybe @reuben can correct me on that ?

Also, what version are you running?

Up to the point that the model has been trained enough to be able to generalize correctly.

So you only have your sentences in the LM, right? You don’t re-use our LM and add your own data ?

That would be really helpful. In case streaming does pick-up reference, I might’ve been digging the wrong hole… @reuben if you can kindly enlighten

Got it.

I’m using my custom LM, with only my sentences in the LM and not re-using your LM.

Although I feel I would have to reuse your LM for generalizing some stuffs (so there, I’ll use your .txt file and add lines from my.txt file and generate a new LM, or use interpolation for LM_deepspeech & LM_custom_generated_by_me).

Hi @reuben looking forward to your feedback on this

I’m pretty sure streaming mode should give you the same result as legacy mode. If it breaks the long sentence into pieces, maybe you have VAD somewhere.