Hi, I was running the Python package for this project and I tried to create a new language model and a trie.
For the language model, I used kenlm’s /bin/lmplz -o 2 < text > text.arpa
and then /bin/build_binary text.arpa text.binary
. Note that the text that I put into the language model consists of 576 sentences, each with one word. Then, I generated the trie. In order to do so, I found the vocabulary that I was using within the language model (it is 576 words), and I created a text file of those words named 576words.txt, with one word on each line. I then generated the trie with /util/generate_trie alphabet.txt text.binary 576words.txt trie
.
I then ran the Python package and here are my two results
Original
be lon f malbax flowr daughter blueprin trackter scoubol a glue canna mustard salt pepper head baned shoveel palace leanax basement it bl
New:
belonfmalbax flower daughter blueprint trackter scoubol aglue annal mustard salt pepper head baned shovel palace leanax basement it bl
However, I want the output to be:
balloon mailbox flower daughter blueprint tractor scalpel igloo canal mustard salt pepper headband shovel palace Kleenex basement igloo
And it looks it would decode it this way if the language model was working. Please advise if I am building the language model in the correct way or if there is any other way to improve this result. Is there any way to force the output to be within the language model?
Thanks so much for creating such a great project!