Model misses some of the word during inference

Hello Team,
@reuben @lissyx @kdavis
Hope you are doing well.
After the suggestion from all of you.
I finally reached WER of 0.05 for my specific 20k words/sentences.
Now i am testing only with 20 words inference with different people and hence included that only 20 words in language model.
But during using and during inference in spite of clear voice, my model misses sometimes some of the words.

chicken crisp half tandoori paneer pizza with cheese
comes as
chicken crisp half paneer pizza with cheese


cheese mysore masala dosa idli vada sambhar
comes as
cheese mysore masala dosa masala.

Despite having included full name in language model why it misses middle or some word in the sentence? Doesnt it should be able to understand full sentence with help of language model?

It’s not clear how you reproduce that behavior.

So during training the model sentences like
“cheese mysore masala dosa” , “idli vada sambhar” , “chicken crisp half”
“tandoori paneer pizza with cheese” were trained seperately.
What i did in language model is i combined them like “chicken crisp half tandoori paneer pizza with cheese” and “cheese mysore masala dosa idli vada sambhar” and all the other words. Because i want them to speak that way more than one items and hence i thought including exact sentence will help the model better as what comes next. but it misses some of the word in middle or sometime take words like

“dal hariyali” as “dal biryani” though in language model i have mentioned
“dal haryali” and “vegetable biryani” seperately.

That’s not answering my question. How exactly do you perform the evaluation where you notice words sometimes missing.

You say:

Because i want them to speak that way more than one items and hence i thought including exact sentence will help the model better as what comes next.

Do you have people speaking live into a microphone and doing live transcription as a way to evaluate ? Is that how you spot missing words ?

yes running it as an api using file and have made webapp. so it takes audio from live mic and then convert it in audio and then take it for inference.
Should i need to change beam width of ctc or alpha or beta for lm?

That makes a lot of variables here that might explain the behavior. This is not really a reliable evaluation, since it highly depends on your current implementation.

You should try and dump some of the failure and analyze closely the sound. Maybe this is related to ?

What do you mean by a lot of variables? Didn’t get it. The sound seems clear the only thing is it is being said little bit fast.
To have the higher influence of lm I should increase alpha and beta? and also the beam width?

You have many parameters in your experience that might have an impact, and so it’s hard to know for sure how reliable it is.

See, that’s seems. But then you say a “bit fast”, that’s something hard to grasp … No recording, nothing reproductible and analyzable so far.

You have also not documented accent. Given the context, I’m inclined to think it might be English Indian? Which in itself, as we know, is already harder to get recognized by the model, since we are (still, too much) biaised towards American English.

You need to run your own experiment to see how that improves, but you would require something more reproductible than “I speak and I see”: recorded sentences, composed into a test-set, that you later evaluate against different values of LM parameters.

I have traine pretrained deepspeech 0.4.1 with 29 people(14 female and 15 male) saying 20k word so in all trained model with 5.8 lacs audio and after that i am getting good wer i.e 0.05 on test dataset.
After that i uploaded the same output.pb to my webapp.It is giving good result 95% of the time but sometime it misses word which is in a particular sentence.
What i want to understand is that according to lm shouldnt it automatically predict the context as per ngram? why word from another sentence comes in one particular sentence even though i have mentioned it in lm? and why it sometimes misses some word which it should be able to predict according to ngram

That’s quite old :confused:

What does this means ? “5.8 lacs” ?

As I said, it can depend on some very specific implementation details … So if you add variability in your reproducing your issue, it’s going to be hard to find what is wrong.

Because the LM is just here to help scoring the probabilities out of the acoustic model. So if for some reason the acoustic model outputs something not expected, then the LM can’t fix all and everything.

Don’t load the LM to see the “”“raw”"" acoustic output … Then come up with something more reproductible … Until then, we’re just blind.

You should really do your experiments on something newer than 0.4.1, there has been a lot of improvements since and huge fixes as well.

@sanjay.pandey So, I just came accross an issue documented in

This allows me to confirm that a small LM works very well for that use-case, even with unmatched accent (my french accent against american english model). Please give it a try.

@sanjay.pandey, Did you got a chance to identify the cause of this behavior. We are also facing the same issue on 0.61, where we have done training on a cleaned dataset and during inference on a clip of around 7-10 seconds, it misses out words from the sentences.
Please share your observations and we can try around the same thing.

1 Like

@sanjay.pandey, are you able to solve the issue and get quality output?