Unable to get good decoding transcript

I have trained Deep speech for Hinglish(Hindi-English mixed) dataset of approx 2000 hrs.

parameters: LR: 0.0001, dropout: default (result not improved on 0.08), n_hidden: 2048, LM_alpha:1.8, LM_Beta:2.5 (try to tweak and got good result on this lm_alpha and lm_beta).

result: training loss as 48.66, valid loss: 54.87, Test WER: 0.24, CER: 13.90 and test loss:54.95.

I am not getting a good prediction (one shot infer from the checkpoint) transcript, as a word is predicted correctly in one sentence and misspelled in successive sentences. Sometimes it predicts a word correctly and sometimes it skips in the same context.
example:
“pre-filter” become prefer or fiction, the “membrane” becomes mam, etc.

I have the following questions:

  1. Is Acoustic model is working good with above losses?
  2. Is there some problem with decoding if it is, how can I improve it?
  3. Do I need to look into the language model?

I am getting a very bad result from the same trained model, I try to rebuild tensorflow and changes beam-width in client.py to 1024, results improved but skips lot of words and adds some new words, that are not spoken. please help.

Any help appreciated.
Thanks

Following. I am working with less data but there is nothing in decoding for me.

What do you mean by “nothing in decoding”? Are you getting all your predicted transcripts correct?

Nothing means, it gives me blank. WER = 100%

Hard to tell without proper knowledge of your datasets

What makes you think so ?

Obviously, especially since you have not documented how you built yours.

Thanks for the prompt reply.

I am using 2000 hrs of hindi-english mixed speech conversation (mostly hindi), my labels are written in roman transcript for hindi.

I am not sure about decoding, i am just asking about it based on results that i mentioned above.

I made my vocab from train speech text and then used kenlm to build lm binary and trie.

The main problem that i am facing is some of the english words (especially Nouns) are predicted correctly at one place and misspelled at other places in same audio.

Could it just be noise / accents / way of speaking that explains ? 2000 hours is not that much, even though it’s already a good level.

Also, nouns, is it likely they are made of less frequent sounds regarding the rest of the dataset ?

How much hrs of data is sufficient ? What amount of minimum train loss / valid loss /WER/CER is good enough that i should target?

Yes, these words are not like general speaking words, thses are some domain specific keywords (Nouns) and are less frequent.

There’s already a lot of messages covering that topic, but ultimately it’s also depending on your dataset, and the training parameters.

Then it may just be your model has not learnt enough yet and/or is overfitted to your training dataset and thus has a hard time to generalize

Thanks, I will look into model.

Correct me if i am wrong, Would it help me if i would train lm with these keywords on top of current language model or if i would add some language information related to noun into lm. please suggest would it possible.

One last Question that i asked earlier:
I am getting very bad result when i predict from model, some sentences are skipped and some new words are added that i have not given in vocab. In order to solve this i make tensorflow build and make BEAM_WIDTH=1024 , as suggested in Getting better prediction accuracy during inshot-inference from checkpoint but less accuracy on trained model?.
My results are improved but still some trans missing and sentence not seems grammatical correct.