Creation of language model and trie

what model do you use ? the one provided by mozilla ? or the one you created by yourself ?

I am using the language model I generate by myself and the output_graph.pb of Mozilla.

their output_graph.pb is their model, learnt by millions on sentences, so lot of possibilities.
I think you made an error in your vocab :
(in cmusphinx, you vocab contained one word by line , with phonems, for next lm)
I propose you to modify your vocab, containing 1 complete sentence (complete wav content) per lign.
and use it to create your lm file (as you did, correctly!)
Like it, you’ll have a lm.binary optimized for your sentences (with a good position words in phrase)
How could your lm predict a good word position in a sentence, if not learnt before ?!..
test it, and tell us if it work

Am I building the trie in the correct way? Should I be using only the 576 words for the vocab?

Yes !
it’s what i did !

for my robot, i needed small corpus, for orders :
created my own wavs, text
my original textfile used as vocab (like it), and for lm creation too
so, both vocab and lm from same vocab file (a complete sentence per lign)

And, if you want, later, to create your own bp model, tell me

here, you’ll have a big model, with a small vocab and a small lm

(if mozilla provides a lm with their pb model, perhaps you could try to use it, but keep your vocab

I managed to solve the problem by doing two things - creating the new language model which included very possible pair of words and then fixing the hyper parameters, specifically the “insertion of words in language model” and “importance of language model” . Thanks so much for your help!

you’re welcome, Francob

When I tried to create my trie, I kept getting this error. Do you have any idea what it might be coming from?

./generate_trie …/models/alphabet.txt …/models/lm.binary …/models/vocab.txt trie
Invalid label A
Aborted (core dumped)

1 Like

Hi.
Do you have uppercases in your alphabet ?
If yes, convert all to lowercase and try again.

1 Like

what can i do to create trie ?

Hi. Read first post (from francob), or the tuto !!

it will create a file named “trie”

You’ll have to call this file, and others, to do inferences (transform sounds in words)
(Please, search a bit on forum, before asking, Thanks Phanthanhlong7695)

1 Like

do you fixed that ?
can you show me how

@francob How did you fix the hyperparemeters “insertion of words in language model” and “importance of language model”?

@francob What are the hyperparemeters “insertion of words in language model” and “importance of language model” actually? Where are they located? In which file? How to modify them?

The current version of generate_trie looks does not need the vocabulary as a input parameter any more. Is that expected?

std::cerr << “Usage: " << argv[0] << " <lm_model> <trie_path>” << std::endl;

The tire I built is only 9 bytes. Can someone help me understand what might be the issue? Thanks

Please share more details, we can’t do divination on what you do.

Thank you lissyx for the quick response.

I am working on a prototype which will need an ASR function for Mandarin.
I am trying to train a model based on the DeepSpeech and this data set

Given an awesome work has been done by yuwu that provides the train materials for the above data set needed by DeepSpeech
http://blog.yuwu.me/wp-content/uploads/2018/07/thchs30-csv.tar.gz

I am reusing these materials (alphabet.txt, vocabulary.txt, words.arpa, lm.binary and the trie) to train the model for a quick testing now.

I was able to train the model to reduce the loss to less than 50 by using the latest master branch of DeepSpeech. But when it is ready to exit the training and do test, it throws the following exception

Error: Can’t parse trie file, invalid header. Try updating your trie file.

I guess the trie from yuwu’s result may be out of date, so I build the generate_trie by following https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md
And then I use the generate_trie command to generate the new trie based on the above yuwu’s alphabet.txt and lm.binary, the new generated trie is only 9 bytes, I don’t know what is wrong. May be the lm.binary is also out of date, I may need regenerate lm.binary as well. But I have not give that a try.

I am wondering if you guys can give me some advice if that is on the correct direction before I try to regenerate the lm.binary.

Thanks

You might be interested in the AISHELL Mandarin dataset: http://www.openslr.org/33/

I just landed an importer for it: https://github.com/mozilla/DeepSpeech/blob/master/bin/import_aishell.py

Thanks Reuben for sharing this!

Do you have the script to generate the alphabet.txt and vocabulary.txt and words.arpa for the data set you linked?