Model misses some of the word during inference

sanjay.pandey · December 18, 2019, 12:13pm

Hello Team,
@reuben @lissyx @kdavis
Hope you are doing well.
After the suggestion from all of you.
I finally reached WER of 0.05 for my specific 20k words/sentences.
Now i am testing only with 20 words inference with different people and hence included that only 20 words in language model.
But during using client.py and during inference in spite of clear voice, my model misses sometimes some of the words.

Example
chicken crisp half tandoori paneer pizza with cheese
comes as
chicken crisp half paneer pizza with cheese

and

cheese mysore masala dosa idli vada sambhar
comes as
cheese mysore masala dosa masala.

Despite having included full name in language model why it misses middle or some word in the sentence? Doesnt it should be able to understand full sentence with help of language model?

lissyx · December 18, 2019, 12:23pm

It’s not clear how you reproduce that behavior.

sanjay.pandey · December 18, 2019, 1:23pm

So during training the model sentences like
“cheese mysore masala dosa” , “idli vada sambhar” , “chicken crisp half”
“tandoori paneer pizza with cheese” were trained seperately.
What i did in language model is i combined them like “chicken crisp half tandoori paneer pizza with cheese” and “cheese mysore masala dosa idli vada sambhar” and all the other words. Because i want them to speak that way more than one items and hence i thought including exact sentence will help the model better as what comes next. but it misses some of the word in middle or sometime take words like

“dal hariyali” as “dal biryani” though in language model i have mentioned
“dal haryali” and “vegetable biryani” seperately.

lissyx · December 18, 2019, 1:29pm

That’s not answering my question. How exactly do you perform the evaluation where you notice words sometimes missing.

You say:

Because i want them to speak that way more than one items and hence i thought including exact sentence will help the model better as what comes next.

Do you have people speaking live into a microphone and doing live transcription as a way to evaluate ? Is that how you spot missing words ?

sanjay.pandey · December 18, 2019, 1:43pm

yes running it as an api using client.py file and have made webapp. so it takes audio from live mic and then convert it in audio and then take it for inference.
Should i need to change beam width of ctc or alpha or beta for lm?

lissyx · December 18, 2019, 1:52pm

That makes a lot of variables here that might explain the behavior. This is not really a reliable evaluation, since it highly depends on your current implementation.

You should try and dump some of the failure and analyze closely the sound. Maybe this is related to Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time. · Issue #2443 · mozilla/DeepSpeech · GitHub ?

sanjay.pandey · December 19, 2019, 10:02am

What do you mean by a lot of variables? Didn’t get it. The sound seems clear the only thing is it is being said little bit fast.
To have the higher influence of lm I should increase alpha and beta? and also the beam width?

lissyx · December 19, 2019, 10:09am

You have many parameters in your experience that might have an impact, and so it’s hard to know for sure how reliable it is.

See, that’s seems. But then you say a “bit fast”, that’s something hard to grasp … No recording, nothing reproductible and analyzable so far.

You have also not documented accent. Given the context, I’m inclined to think it might be English Indian? Which in itself, as we know, is already harder to get recognized by the model, since we are (still, too much) biaised towards American English.

You need to run your own experiment to see how that improves, but you would require something more reproductible than “I speak and I see”: recorded sentences, composed into a test-set, that you later evaluate against different values of LM parameters.

sanjay.pandey · December 19, 2019, 10:35am

I have traine pretrained deepspeech 0.4.1 with 29 people(14 female and 15 male) saying 20k word so in all trained model with 5.8 lacs audio and after that i am getting good wer i.e 0.05 on test dataset.
After that i uploaded the same output.pb to my webapp.It is giving good result 95% of the time but sometime it misses word which is in a particular sentence.
What i want to understand is that according to lm shouldnt it automatically predict the context as per ngram? why word from another sentence comes in one particular sentence even though i have mentioned it in lm? and why it sometimes misses some word which it should be able to predict according to ngram

lissyx · December 19, 2019, 10:49am

That’s quite old

What does this means ? “5.8 lacs” ?

As I said, it can depend on some very specific implementation details … So if you add variability in your reproducing your issue, it’s going to be hard to find what is wrong.

Because the LM is just here to help scoring the probabilities out of the acoustic model. So if for some reason the acoustic model outputs something not expected, then the LM can’t fix all and everything.

Don’t load the LM to see the “”“raw”“” acoustic output … Then come up with something more reproductible … Until then, we’re just blind.

lissyx · December 19, 2019, 10:50am

You should really do your experiments on something newer than 0.4.1, there has been a lot of improvements since and huge fixes as well.

lissyx · December 20, 2019, 12:46pm

@sanjay.pandey So, I just came accross an issue documented in https://github.com/mozilla/DeepSpeech/issues/2612

This allows me to confirm that a small LM works very well for that use-case, even with unmatched accent (my french accent against american english model). Please give it a try.

Kunal.botify · March 1, 2020, 6:21am

@sanjay.pandey, Did you got a chance to identify the cause of this behavior. We are also facing the same issue on 0.61, where we have done training on a cleaned dataset and during inference on a clip of around 7-10 seconds, it misses out words from the sentences.
Please share your observations and we can try around the same thing.

Kunal.botify · May 14, 2020, 6:02am

@sanjay.pandey, are you able to solve the issue and get quality output?

reza · November 4, 2020, 8:03am

are you able to solve the issue and get quality output?

reza · November 4, 2020, 8:05am

Description:
I downloaded samples wav from release folder of deepspeech client and stripped some audio , so that for human hear it still recognizable , but when feeded to ds client recognition do not work for first word
eg. should an hold on the way
if i added extra silence in this trimmed audio in front , about 800 samples ( 5ms)
then recognition for works/close to first word
eg after adding silence.
what should one hold on the way

Topic		Replies	Views
Using Deep Speech DeepSpeech	34	12890	August 20, 2019
Pretrained Model cannot provide accurate English words DeepSpeech	18	976	November 8, 2018
Some words getting skipped in whole sentence DeepSpeech	9	698	May 22, 2019
DeepSpeech full explaination DeepSpeech	3	3231	July 19, 2019
Language Model during training effect DeepSpeech	6	1342	August 15, 2019

Model misses some of the word during inference

Related topics