How can i add custom vocab.txt and build a language model lm.binary, trie for pretrained model v0.2.0

palash.shinde · April 10, 2019, 9:14am

I am following how-i-trained-a-specific-french-model tutorial and through /bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3 & /bin/bin/./build_binary -T -s words.arpa lm.binary.

I was able to genrate words.arpa and lm.binary. Also my vocabulary.txt contains all the text in UPPER CASE.

In order to genate these (.apra , .binary) files it was advice to follow https://github.com/mozilla/DeepSpeech/tree/master/data/lm README.md

Doubt:-

If I am able to generate .apra & .binary files through above command of lmplz and build_binary mentioned in french tutorial link, respectively with UPPER CASE text (which my vocabulary.txt)

Then from data/lm/README.md

a. ) Why are converting the whole text to Lowercase ? (Given that i was able to generate .binary file from my UPPER CASE text in vocabulary.txt)

b. ) Is it compulsion to convert the UPPER CASE text to lower case? Will this have any repercussion going forward if the whole upper case text in vocabulary.txt is not converted into lower case?

c. ) Also could you please care to explain -memory & -prune ? defined as per below

!lmplz --order 5
–temp_prefix /tmp/
–memory 50%
–text {data_lower}
–arpa {lm_path}
–prune 0 0 0 1
d ) Also could you please care to explain -a 255 & -q 8 defined as per below

!build_binary -a 255
-q 8
trie
{lm_path}
{binary_path}

I checked the official kheafield portal but couldn’t get through these.