Setting language model weight to 0 gives different results for different language models

I’m trying to reproduce your results on LibriSpeech clean test dataset with a pre-trained model. I was playing around with a language model. I thought that if I set LM_WEIGHT to 0 then it only will take in account acoustic model:

ds = Model(models, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)
ds.enableDecoderWithLM(alphabet, lm, trie, LM_WEIGHT, VALID_WORD_COUNT_WEIGHT)

However, I get different results, if I pass different language models and set LM_WEIGHT to zero. To be specific in one case I pass trie and lm.binary which are empty files and in other case files that I downloaded from project’s GitHub.
Can you please explain this behaviour.

@Kirill can you document exactly your STR ?

I checked github issues, I guess it’s related to

But you have not replied to my question, and so I cannot help you. It might be related, but unless we know which version you tested …

Sure. I used master and script. I added evaluate function and modified load_model(last parameter is lm_weight):

This code does a lot of other things, can you please stick to trivial ones like native_client/python/ ?

Hi @lissyx, I’ve got the same question:

I have just changed in native_clienty/python/

# The alpha hyperparameter of the CTC decoder. Language Model weight
# The beta hyperparameter of the CTC decoder. Word insertion bonus.

and afterwards used different LMs, like:

python native_client/python/ --model training_accurate/export_dir3/output_graph.pb --alphabet deepspeech-0.5.0/model/alphabet.txt --audio $FILE1 --lm lm_trie_vocab/lm.binary --trie lm_trie_vocab/trie --extended
python native_client/python/ --model training_accurate/export_dir3/output_graph.pb --alphabet deepspeech-0.5.0/model/alphabet.txt --audio $FILE1 --lm lm/lm.binary --trie lm/trie --extended
and I got different outputs. How can that happen? Is there maybe instead another possibility to not using the LM for it? It seems like it’s not that easy to exclude the Scorer as it is in because it is using the binary here?! Thanks already

Oh, I can actually answer my own question for everyone else who has this problem: in native_client/python/ it is not neccessary to use a LM and in contrast to there is no default setting ! So easily: do not us the flags and it won’t take an LM into account I guess, because: (in

if args.lm and args.trie:
    print('Loading language model from files {} {}'.format(args.lm, args.trie), file=sys.stderr)
    lm_load_start = timer()
    ds.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_ALPHA, LM_BETA)
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)

It would have been useful to share those.


Ok true, it is still weird. Maybe the different tries have something to do with it?

Output 1: (lm_trie_vocab)

Loading language model from files lm_trie_vocab/lm.binary lm_trie_vocab/trie
Loaded language model in 0.004s.
Running inference.
2019-08-15 12:28:26.337842: I tensorflow/stream_executor/] successfully opened CUDA library locally
probieren wir es mal mit den aktuellen steg deutschen worten zum einen gibt es die donau darm schiff fahrt gesellschaft kapitaen wir wird welcher de rind fleisch etikett er uns ueber wach uns aufgaben ueber trage uns gesetz platz machen musste was fuer toll woerter
Inference took 4.178s for 14.282s audio file.

Output 2: (lm)

Loading language model from files lm/lm.binary lm/trie
Loaded language model in 0.000161s.
Running inference.
2019-08-15 12:32:49.445661: I tensorflow/stream_executor/] successfully opened CUDA library locally
wir wir es mal mit den aktuell laengsten deutschen wir den dem einen gibt es die deutschen es welche den wir wir welche dem den welche die den fuer machen fuer laengsten machen mit es fuer tolle wir to
Inference took 4.409s for 14.282s audio file.

Both are German Language Models, one with lots of text and one containing just the training textfiles.

@ena.1994 I don’t understand, you run inference with two different set of LM+tries, why are your surprised that the final decoding is different ?

I am surprised because I was thinking setting both LM hyperparameters to ZERO would leed to not using the LM even though there is in both cases one mentioned . In both cases those hyperparameters LM_BETA = 0 and LM_ALPHA = 0. Like @Kirill asked before.

Now your statement is confusing. What is #9 about ? You said you found how to not enable the LM ? And then you compare with two different LMs ? And then you talk about LM weights ?

Sorry for confusing! I hope I can make that clear :Ok first I wanted to not use LM with setting the weights to zero, afterwords I found out it’s easier when I’m instead not using the flags and its working fine. So I don’t really have a problem here anymore. BUT the origin question was: why is using an LM even though the weights are both set to zero? That must be the case because otherwise the outputs should be the same as they don’t really use the (different) LMs . Maybe it is bug OR I am missunderstanding the role of the LM_ALPHA and LM_BETA hyperparamters.

Can you explain thoroughly all your steps for testing LM_ALPHA and LM_BETA ? I might have got a clue while taking my shower …

I have the same thing.
I trained my own non-English model using v.0.5.1. Setting alpha and beta to 0 when running generates different results on different language models on the same test data.

The only way that gives me unified results without language model was to assign scorer=None.

Can you share more context / examples of the variations ?

@lissyx, Context like what, I would like to help?

I am using special unicode characters in my transcript and alphabets. Which are different than the English alphabets.

I have tested with two different language models, each was tested using:

  1. default values: lm_alpha=0.75 and lm_beta=1.85, results were acceptable where each word in the decoded results belongs to the used language model.

  2. then each was tested with lm_alpha=0 and lm_beta=0, decoded words were belongs only to the used language model. Different language models generate different results.

  3. tested by removing ‘scorer’ line from ‘’ code and putting ‘scorer=None’ instead. Different language models generate same results. Decoded words not necessarily belong to any of the used language models. However, this method needs a lot of time.

Then, I tested with a language model that is not related to my work and data. I used the paths of one of the provided English LMs, binary and trie. Assigning 0 to lm_alpha and lm_beta generates unaccepted results.

Could you help me in the best way to disable using of language model in the final decoding results?

I am using v.0.5.1
I tested the above using besides doing single shot inference from

I don’t see the point here.

Please, share examples.

Could you explain the usecase here ? Are you trying to achieve something ? Or just debugging the different results when LM weights are 0 ?

“garbage unaccepted results”, again, it would be nice that you share examples …

Without anything that we can reproduce on our side, it’s going to be complicated to fix.