Creation of language model and trie

elpimous_robot · November 30, 2017, 9:26pm

what model do you use ? the one provided by mozilla ? or the one you created by yourself ?

francob · November 30, 2017, 9:27pm

I am using the language model I generate by myself and the output_graph.pb of Mozilla.

elpimous_robot · November 30, 2017, 9:41pm

their output_graph.pb is their model, learnt by millions on sentences, so lot of possibilities.
I think you made an error in your vocab :
(in cmusphinx, you vocab contained one word by line , with phonems, for next lm)
I propose you to modify your vocab, containing 1 complete sentence (complete wav content) per lign.
and use it to create your lm file (as you did, correctly!)
Like it, you’ll have a lm.binary optimized for your sentences (with a good position words in phrase)
How could your lm predict a good word position in a sentence, if not learnt before ?!..
test it, and tell us if it work

francob · November 30, 2017, 9:46pm

Am I building the trie in the correct way? Should I be using only the 576 words for the vocab?

elpimous_robot · November 30, 2017, 9:50pm

Yes !
it’s what i did !

for my robot, i needed small corpus, for orders :
created my own wavs, text
my original textfile used as vocab (like it), and for lm creation too
so, both vocab and lm from same vocab file (a complete sentence per lign)

And, if you want, later, to create your own bp model, tell me

elpimous_robot · November 30, 2017, 9:53pm

here, you’ll have a big model, with a small vocab and a small lm

(if mozilla provides a lm with their pb model, perhaps you could try to use it, but keep your vocab

francob · December 1, 2017, 6:50pm

I managed to solve the problem by doing two things - creating the new language model which included very possible pair of words and then fixing the hyper parameters, specifically the “insertion of words in language model” and “importance of language model” . Thanks so much for your help!

elpimous_robot · December 1, 2017, 7:30pm

you’re welcome, Francob

Arianna_m · December 7, 2017, 2:04am

When I tried to create my trie, I kept getting this error. Do you have any idea what it might be coming from?

./generate_trie …/models/alphabet.txt …/models/lm.binary …/models/vocab.txt trie
Invalid label A
Aborted (core dumped)

elpimous_robot · December 10, 2017, 12:09pm

Hi.
Do you have uppercases in your alphabet ?
If yes, convert all to lowercase and try again.

phanthanhlong7695 · January 26, 2018, 8:23am

what can i do to create trie ?

elpimous_robot · January 26, 2018, 11:57am

Hi. Read first post (from francob), or the tuto !!

it will create a file named “trie”

You’ll have to call this file, and others, to do inferences (transform sounds in words)
(Please, search a bit on forum, before asking, Thanks Phanthanhlong7695)

phanthanhlong7695 · January 27, 2018, 4:21am

do you fixed that ?
can you show me how

deepakgupta1313 · June 18, 2018, 4:55pm

@francob How did you fix the hyperparemeters “insertion of words in language model” and “importance of language model”?

deepakgupta1313 · June 18, 2018, 4:58pm

@francob What are the hyperparemeters “insertion of words in language model” and “importance of language model” actually? Where are they located? In which file? How to modify them?

allen23777 · April 28, 2019, 8:13am

The current version of generate_trie looks does not need the vocabulary as a input parameter any more. Is that expected?

std::cerr << “Usage: " << argv[0] << " <lm_model> <trie_path>” << std::endl;

The tire I built is only 9 bytes. Can someone help me understand what might be the issue? Thanks

lissyx · April 29, 2019, 7:10am

Please share more details, we can’t do divination on what you do.

allen23777 · April 29, 2019, 9:02pm

Thank you lissyx for the quick response.

I am working on a prototype which will need an ASR function for Mandarin.
I am trying to train a model based on the DeepSpeech and this data set

Given an awesome work has been done by yuwu that provides the train materials for the above data set needed by DeepSpeech
http://blog.yuwu.me/wp-content/uploads/2018/07/thchs30-csv.tar.gz

I am reusing these materials (alphabet.txt, vocabulary.txt, words.arpa, lm.binary and the trie) to train the model for a quick testing now.

I was able to train the model to reduce the loss to less than 50 by using the latest master branch of DeepSpeech. But when it is ready to exit the training and do test, it throws the following exception

Error: Can’t parse trie file, invalid header. Try updating your trie file.

I guess the trie from yuwu’s result may be out of date, so I build the generate_trie by following https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md
And then I use the generate_trie command to generate the new trie based on the above yuwu’s alphabet.txt and lm.binary, the new generated trie is only 9 bytes, I don’t know what is wrong. May be the lm.binary is also out of date, I may need regenerate lm.binary as well. But I have not give that a try.

I am wondering if you guys can give me some advice if that is on the correct direction before I try to regenerate the lm.binary.

Thanks

reuben · April 29, 2019, 11:12pm

You might be interested in the AISHELL Mandarin dataset: http://www.openslr.org/33/

I just landed an importer for it: https://github.com/mozilla/DeepSpeech/blob/master/bin/import_aishell.py

allen23777 · May 4, 2019, 5:50pm

Thanks Reuben for sharing this!

Do you have the script to generate the alphabet.txt and vocabulary.txt and words.arpa for the data set you linked?