Hi guys, when creating language model with KenLM and I have know that KenLM use the N-grams model. So I have 2 questions for this:
-
When I build an *.arpa file from a text.txt file. Did all the sentences in the text.txt need to have the length from 3 to 5 words to get the best LM? Because my text is about 12000 sentences and more than 80% of them have length about 8-15.
-
I’m using this command to build the *.arpa file:
./lmplz --text text.txt --arpa text2.arpa --o 5
. Did I need to change the value of the last param (currently 5) to some other value like 3 or 4 based on my data as above ?