KenLM LM vs trie

Hi, I am having a small problem understanding the process of creating the language model. I pass a big corpus of text and build a language model that learns the probabilities of the n-grams (I do not know if by using order 5 I use 5-grams or from unigrams to 5-grams)

# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text {data_lower} \
       --arpa {lm_path} \
       --prune 0 0 0 1

I build the language model

!build_binary -a 255 \
              -q 8 \
              trie \
              {lm_path} \
              {binary_path} 

Now I have the model in binary format to make it faster.

./generate_trie ../data/alphabet.txt /tmp/lm.binary /tmp/trie

But what is the trie used for? I do not really understand that, would Deepspeech work without it?

Thanks a lot for your help!

I think it should work without the trie but the deepspeech script does not use the language model at all unless both --lm and --trie files are passed as argument.

I am not sure if this trie file is related at all to the trie binary mentioned in the kenlm docs
https://kheafield.com/code/kenlm/structures/ since it is built by a deepspeech specific tool (native_client/generate_trie)

I’m also unsure why deepspeech also created a generate_trie tool. That said, you need to supply both for the deepspeech client.

if args.lm and args.trie:
    print('Loading language model from files {} {}'.format(args.lm, args.trie), file=sys.stderr)
    lm_load_start = timer()
    ds.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_ALPHA, LM_BETA)
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)

No, it does not work without the trie. It’s a data structure used for keeping track of the minimum word probability a word prefix can lead to, and is used in the scoring process.

1 Like

It worked for me on a limited (~12 words) vocabulary LM without passing the trie (it was causing a segfault when I passed the trie I built), but it probably does not otherwise. I just removed the ‘and args.trie’ check, a hack I know.

That just makes it generate the trie in memory at the beginning of the process.

I am also confused with this question now,I do not understand the difference between build_binary and generate_trie,do you understand?If so ,can you tell me?

Thanks a lot