KenLM LM vs trie

daniel.cruzado · March 8, 2019, 1:52pm

Hi, I am having a small problem understanding the process of creating the language model. I pass a big corpus of text and build a language model that learns the probabilities of the n-grams (I do not know if by using order 5 I use 5-grams or from unigrams to 5-grams)

# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text {data_lower} \
       --arpa {lm_path} \
       --prune 0 0 0 1

I build the language model

!build_binary -a 255 \
              -q 8 \
              trie \
              {lm_path} \
              {binary_path}

Now I have the model in binary format to make it faster.

./generate_trie ../data/alphabet.txt /tmp/lm.binary /tmp/trie

But what is the trie used for? I do not really understand that, would Deepspeech work without it?

Thanks a lot for your help!

dvz · March 14, 2019, 9:10am

I think it should work without the trie but the deepspeech script does not use the language model at all unless both --lm and --trie files are passed as argument.

I am not sure if this trie file is related at all to the trie binary mentioned in the kenlm docs
https://kheafield.com/code/kenlm/structures/ since it is built by a deepspeech specific tool (native_client/generate_trie)

cah · March 21, 2019, 5:15pm

I’m also unsure why deepspeech also created a generate_trie tool. That said, you need to supply both for the deepspeech client.

if args.lm and args.trie:
    print('Loading language model from files {} {}'.format(args.lm, args.trie), file=sys.stderr)
    lm_load_start = timer()
    ds.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_ALPHA, LM_BETA)
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)

reuben · March 21, 2019, 5:43pm

No, it does not work without the trie. It’s a data structure used for keeping track of the minimum word probability a word prefix can lead to, and is used in the scoring process.

dvz · March 21, 2019, 6:05pm

It worked for me on a limited (~12 words) vocabulary LM without passing the trie (it was causing a segfault when I passed the trie I built), but it probably does not otherwise. I just removed the ‘and args.trie’ check, a hack I know.

reuben · March 21, 2019, 6:08pm

That just makes it generate the trie in memory at the beginning of the process.

carlfm01 · March 21, 2019, 6:29pm

fecinly2011 · April 13, 2019, 6:53am

I am also confused with this question now,I do not understand the difference between build_binary and generate_trie，do you understand?If so ,can you tell me?

Thanks a lot

Topic		Replies	Views
Creation of language model and trie DeepSpeech	28	12804	August 7, 2019
Language Model during training effect DeepSpeech	6	1332	August 15, 2019
Letter based language model DeepSpeech	1	779	December 19, 2018
Language Model Creation DeepSpeech	24	3910	October 18, 2019
Fine tuning 0.5.1 - Do I need to create a lm.binary and trie file for the training for common voice or can I use language model already in 0.5.1 DeepSpeech	6	1401	September 24, 2019

KenLM LM vs trie

Related topics