Creation of language model and trie

francob · November 30, 2017, 6:26pm

Hi, I was running the Python package for this project and I tried to create a new language model and a trie.

For the language model, I used kenlm’s /bin/lmplz -o 2 < text > text.arpa and then /bin/build_binary text.arpa text.binary. Note that the text that I put into the language model consists of 576 sentences, each with one word. Then, I generated the trie. In order to do so, I found the vocabulary that I was using within the language model (it is 576 words), and I created a text file of those words named 576words.txt, with one word on each line. I then generated the trie with /util/generate_trie alphabet.txt text.binary 576words.txt trie.

I then ran the Python package and here are my two results

Original
be lon f malbax flowr daughter blueprin trackter scoubol a glue canna mustard salt pepper head baned shoveel palace leanax basement it bl

New:
belonfmalbax flower daughter blueprint trackter scoubol aglue annal mustard salt pepper head baned shovel palace leanax basement it bl

However, I want the output to be:

balloon mailbox flower daughter blueprint tractor scalpel igloo canal mustard salt pepper headband shovel palace Kleenex basement igloo

And it looks it would decode it this way if the language model was working. Please advise if I am building the language model in the correct way or if there is any other way to improve this result. Is there any way to force the output to be within the language model?

Thanks so much for creating such a great project!

elpimous_robot · November 30, 2017, 9:19pm

well, It seems correct !!!
But it seems that your model has a bad WER !
for your binary, you must sent to kenlm a textfile containing your sentences (not words one per one)
(complete sentences, for words placments probabilities)
your vocab is wrong (not like cmusphinx !!!)

francob · November 30, 2017, 9:22pm

Thanks for the help! For the vocab (when I generate the trie), what should I be using? Should I use the 576words.txt or should I be using some other thing?

elpimous_robot · November 30, 2017, 9:26pm

what model do you use ? the one provided by mozilla ? or the one you created by yourself ?

francob · November 30, 2017, 9:27pm

I am using the language model I generate by myself and the output_graph.pb of Mozilla.

elpimous_robot · November 30, 2017, 9:41pm

their output_graph.pb is their model, learnt by millions on sentences, so lot of possibilities.
I think you made an error in your vocab :
(in cmusphinx, you vocab contained one word by line , with phonems, for next lm)
I propose you to modify your vocab, containing 1 complete sentence (complete wav content) per lign.
and use it to create your lm file (as you did, correctly!)
Like it, you’ll have a lm.binary optimized for your sentences (with a good position words in phrase)
How could your lm predict a good word position in a sentence, if not learnt before ?!..
test it, and tell us if it work

francob · November 30, 2017, 9:46pm

Am I building the trie in the correct way? Should I be using only the 576 words for the vocab?

elpimous_robot · November 30, 2017, 9:50pm

Yes !
it’s what i did !

for my robot, i needed small corpus, for orders :
created my own wavs, text
my original textfile used as vocab (like it), and for lm creation too
so, both vocab and lm from same vocab file (a complete sentence per lign)

And, if you want, later, to create your own bp model, tell me

elpimous_robot · November 30, 2017, 9:53pm

here, you’ll have a big model, with a small vocab and a small lm

(if mozilla provides a lm with their pb model, perhaps you could try to use it, but keep your vocab

francob · December 1, 2017, 6:50pm

I managed to solve the problem by doing two things - creating the new language model which included very possible pair of words and then fixing the hyper parameters, specifically the “insertion of words in language model” and “importance of language model” . Thanks so much for your help!

elpimous_robot · December 1, 2017, 7:30pm

you’re welcome, Francob

Arianna_m · December 7, 2017, 2:04am

When I tried to create my trie, I kept getting this error. Do you have any idea what it might be coming from?

./generate_trie …/models/alphabet.txt …/models/lm.binary …/models/vocab.txt trie
Invalid label A
Aborted (core dumped)

elpimous_robot · December 10, 2017, 12:09pm

Hi.
Do you have uppercases in your alphabet ?
If yes, convert all to lowercase and try again.

phanthanhlong7695 · January 26, 2018, 8:23am

what can i do to create trie ?

elpimous_robot · January 26, 2018, 11:57am

Hi. Read first post (from francob), or the tuto !!

it will create a file named “trie”

You’ll have to call this file, and others, to do inferences (transform sounds in words)
(Please, search a bit on forum, before asking, Thanks Phanthanhlong7695)

phanthanhlong7695 · January 27, 2018, 4:21am

do you fixed that ?
can you show me how

deepakgupta1313 · June 18, 2018, 4:55pm

@francob How did you fix the hyperparemeters “insertion of words in language model” and “importance of language model”?

deepakgupta1313 · June 18, 2018, 4:58pm

@francob What are the hyperparemeters “insertion of words in language model” and “importance of language model” actually? Where are they located? In which file? How to modify them?

allen23777 · April 28, 2019, 8:13am

The current version of generate_trie looks does not need the vocabulary as a input parameter any more. Is that expected?

std::cerr << “Usage: " << argv[0] << " <lm_model> <trie_path>” << std::endl;

The tire I built is only 9 bytes. Can someone help me understand what might be the issue? Thanks

lissyx · April 29, 2019, 7:10am

Please share more details, we can’t do divination on what you do.