Hi, I was running the Python package for this project and I tried to create a new language model and a trie.
For the language model, I used kenlm’s /bin/lmplz -o 2 < text > text.arpa and then /bin/build_binary text.arpa text.binary. Note that the text that I put into the language model consists of 576 sentences, each with one word. Then, I generated the trie. In order to do so, I found the vocabulary that I was using within the language model (it is 576 words), and I created a text file of those words named 576words.txt, with one word on each line. I then generated the trie with /util/generate_trie alphabet.txt text.binary 576words.txt trie.
I then ran the Python package and here are my two results
Original
be lon f malbax flowr daughter blueprin trackter scoubol a glue canna mustard salt pepper head baned shoveel palace leanax basement it bl
New:
belonfmalbax flower daughter blueprint trackter scoubol aglue annal mustard salt pepper head baned shovel palace leanax basement it bl
And it looks it would decode it this way if the language model was working. Please advise if I am building the language model in the correct way or if there is any other way to improve this result. Is there any way to force the output to be within the language model?
well, It seems correct !!!
But it seems that your model has a bad WER !
for your binary, you must sent to kenlm a textfile containing your sentences (not words one per one)
(complete sentences, for words placments probabilities)
your vocab is wrong (not like cmusphinx !!!)
Thanks for the help! For the vocab (when I generate the trie), what should I be using? Should I use the 576words.txt or should I be using some other thing?
their output_graph.pb is their model, learnt by millions on sentences, so lot of possibilities.
I think you made an error in your vocab :
(in cmusphinx, you vocab contained one word by line , with phonems, for next lm)
I propose you to modify your vocab, containing 1 complete sentence (complete wav content) per lign.
and use it to create your lm file (as you did, correctly!)
Like it, you’ll have a lm.binary optimized for your sentences (with a good position words in phrase)
How could your lm predict a good word position in a sentence, if not learnt before ?!..
test it, and tell us if it work
for my robot, i needed small corpus, for orders :
created my own wavs, text
my original textfile used as vocab (like it), and for lm creation too
so, both vocab and lm from same vocab file (a complete sentence per lign)
And, if you want, later, to create your own bp model, tell me
I managed to solve the problem by doing two things - creating the new language model which included very possible pair of words and then fixing the hyper parameters, specifically the “insertion of words in language model” and “importance of language model” . Thanks so much for your help!
Hi. Read first post (from francob), or the tuto !!
it will create a file named “trie”
You’ll have to call this file, and others, to do inferences (transform sounds in words)
(Please, search a bit on forum, before asking, Thanks Phanthanhlong7695)
@francob What are the hyperparemeters “insertion of words in language model” and “importance of language model” actually? Where are they located? In which file? How to modify them?