Creation of language model and trie

(Francob) #1

Hi, I was running the Python package for this project and I tried to create a new language model and a trie.

For the language model, I used kenlm’s /bin/lmplz -o 2 < text > and then /bin/build_binary text.binary. Note that the text that I put into the language model consists of 576 sentences, each with one word. Then, I generated the trie. In order to do so, I found the vocabulary that I was using within the language model (it is 576 words), and I created a text file of those words named 576words.txt, with one word on each line. I then generated the trie with /util/generate_trie alphabet.txt text.binary 576words.txt trie.

I then ran the Python package and here are my two results

be lon f malbax flowr daughter blueprin trackter scoubol a glue canna mustard salt pepper head baned shoveel palace leanax basement it bl

belonfmalbax flower daughter blueprint trackter scoubol aglue annal mustard salt pepper head baned shovel palace leanax basement it bl

However, I want the output to be:

balloon mailbox flower daughter blueprint tractor scalpel igloo canal mustard salt pepper headband shovel palace Kleenex basement igloo

And it looks it would decode it this way if the language model was working. Please advise if I am building the language model in the correct way or if there is any other way to improve this result. Is there any way to force the output to be within the language model?

Thanks so much for creating such a great project!

(Vincent Foucault) #2

well, It seems correct !!!
But it seems that your model has a bad WER !
for your binary, you must sent to kenlm a textfile containing your sentences (not words one per one)
(complete sentences, for words placments probabilities)
your vocab is wrong (not like cmusphinx !!!)

(Francob) #3

Thanks for the help! For the vocab (when I generate the trie), what should I be using? Should I use the 576words.txt or should I be using some other thing?

(Vincent Foucault) #4

what model do you use ? the one provided by mozilla ? or the one you created by yourself ?

(Francob) #5

I am using the language model I generate by myself and the output_graph.pb of Mozilla.

(Vincent Foucault) #6

their output_graph.pb is their model, learnt by millions on sentences, so lot of possibilities.
I think you made an error in your vocab :
(in cmusphinx, you vocab contained one word by line , with phonems, for next lm)
I propose you to modify your vocab, containing 1 complete sentence (complete wav content) per lign.
and use it to create your lm file (as you did, correctly!)
Like it, you’ll have a lm.binary optimized for your sentences (with a good position words in phrase)
How could your lm predict a good word position in a sentence, if not learnt before ?!..
test it, and tell us if it work

(Francob) #7

Am I building the trie in the correct way? Should I be using only the 576 words for the vocab?

(Vincent Foucault) #8

Yes !
it’s what i did !

for my robot, i needed small corpus, for orders :
created my own wavs, text
my original textfile used as vocab (like it), and for lm creation too
so, both vocab and lm from same vocab file (a complete sentence per lign)

And, if you want, later, to create your own bp model, tell me

(Vincent Foucault) #9

here, you’ll have a big model, with a small vocab and a small lm

(if mozilla provides a lm with their pb model, perhaps you could try to use it, but keep your vocab

(Francob) #10

I managed to solve the problem by doing two things - creating the new language model which included very possible pair of words and then fixing the hyper parameters, specifically the “insertion of words in language model” and “importance of language model” . Thanks so much for your help!

(Vincent Foucault) #11

you’re welcome, Francob

(Arianna) #12

When I tried to create my trie, I kept getting this error. Do you have any idea what it might be coming from?

./generate_trie …/models/alphabet.txt …/models/lm.binary …/models/vocab.txt trie
Invalid label A
Aborted (core dumped)

(Vincent Foucault) #13

Do you have uppercases in your alphabet ?
If yes, convert all to lowercase and try again.

(Phanthanhlong7695) #14

what can i do to create trie ?

(Vincent Foucault) #15

Hi. Read first post (from francob), or the tuto !!

it will create a file named “trie”

You’ll have to call this file, and others, to do inferences (transform sounds in words)
(Please, search a bit on forum, before asking, Thanks Phanthanhlong7695)

(Phanthanhlong7695) #16

do you fixed that ?
can you show me how

(Deepak Gupta) #17

@francob How did you fix the hyperparemeters “insertion of words in language model” and “importance of language model”?

(Deepak Gupta) #18

@francob What are the hyperparemeters “insertion of words in language model” and “importance of language model” actually? Where are they located? In which file? How to modify them?