Hi, I am having a small problem understanding the process of creating the language model. I pass a big corpus of text and build a language model that learns the probabilities of the n-grams (I do not know if by using order 5 I use 5-grams or from unigrams to 5-grams)
I think it should work without the trie but the deepspeech script does not use the language model at all unless both --lm and --trie files are passed as argument.
I am not sure if this trie file is related at all to the trie binary mentioned in the kenlm docs https://kheafield.com/code/kenlm/structures/ since it is built by a deepspeech specific tool (native_client/generate_trie)
Iām also unsure why deepspeech also created a generate_trie tool. That said, you need to supply both for the deepspeech client.
if args.lm and args.trie:
print('Loading language model from files {} {}'.format(args.lm, args.trie), file=sys.stderr)
lm_load_start = timer()
ds.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_ALPHA, LM_BETA)
lm_load_end = timer() - lm_load_start
print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
No, it does not work without the trie. Itās a data structure used for keeping track of the minimum word probability a word prefix can lead to, and is used in the scoring process.
It worked for me on a limited (~12 words) vocabulary LM without passing the trie (it was causing a segfault when I passed the trie I built), but it probably does not otherwise. I just removed the āand args.trieā check, a hack I know.
I am also confused with this question now,I do not understand the difference between build_binary and generate_trieļ¼do you understand?If so ,can you tell me?