Different LM produces drastic effect on WER

Case 1: (Urdu Language)
Dataset used : Own (15 hr)
Version: v.0.6
Experiment 1: Tested the trained model, with LM, built using train and val.csv produces Avg. WER=0.138
Experiment 2: Tested the trained model, with LM built using Wiki text, produces Avg. WER = 0.128
Comments: Experiment 2 has better inference, assuming my intuition, that better LM produces better inference.

Case 1: (Tamil Language)
Dataset used : Common Voice(~12 hr)
Version: v.0.7
Experiment 1: Tested the trained model, with LM, built using train and val.csv produces Avg. WER=0.17
Experiment 2: Tested the trained model, with LM built using Wiki text, produces Avg. WER = 0.8
Comments: Experiment 2 has worst inference, contradicting case 1.

I can’t figure out, where the error may be.

If the results of case 1 are expected, then why does case 2 contradicts it.
@lissyx

I don’t know ? It depends on so much parameters. You share basically no actionable information and you compare different language and different versions. There’s no way you can seriously expect to do a serious comparison from that, do you?

lissyx is right, if you want help from us, please provide more information. But there were questions about Tamil before, why not contact them and see whether this is a Tamil specific problem?

Yeah, i understand i haven’t shared detailed info, I can share any info required to understand the issue. I will probably try the case 1 with version 0.7, therby using same versions, and compare the results.

But, can you clarify me one thing. In a typical case, Which LM would give better inference, LM with corpus-only text, or LM with corpus-and-wiki text

Your problem is not well defined, and it depends so much on the data … Generally speaking, adding corpus data into the LM is just plain wrong.

If i remember correctly, i have read in this forum, we build the vocabulary.txt from the train.csv (obviously test.csv is excluded), and this vocabulary.txt is used to build the LM.

Can you kindly clarify this

LM is used at test step, how can you expect to get a real view of the capabilities of your network regarding generalization if you feed it data you used during training?

Yeah, I completely understand and agree., but it is been mis-informmed at many places in this forum, Thanks for clarifying,

LM should be built using very large number of sentences, may be obtained from wiki, news websites etc.

And when buildling LM, we need to make sure, we dont include corpus(train, dev, test) transcripts in the vocabulary.txt; Am i right?

If we really need to compare performance of different LM’s, for a domain specific ASR., then we can compare LM built using wiki, and LM built using news websites etc. Right?

Please link instead of staying vague.

It looks like you understand what you are doing, so it’s unclear why you did it at first.

Again, maybe you have a specific usecase where it makes sense, but we are not in your head, so you need to explicit …

Now, i realize, may be the above is true for this particular use case, command recogntion. However for a general ASR, i agree to your words.

Also, do you understand that not only this is a very specific usecase, it is also a super-old thread and things might not be true anymore?

Yeah. Thanks for all the help. I knew the info is too old, but also my relation with deepspeech. I have been training since v0.4