Generating trie

Are you looking at 0.6.1 release files?

yes i am looking at 0.6.1 release

Then look at the files for 0.6.1 releases, it’s there.

There are prebuilt versions of the native client, if you cloned the repo, you can run

python3 util/taskcluster.py --target native_client

and it should download it for you. Afterwards you’ll have a generate trie.

2 Likes

thanks @lissyx . generated trie for my custom lm

1 Like

Hello! I performed python3 util/taskcluster.py --target native_client and generate_trie is still not around :frowning: I’m pretty stuck at this step for days…any ideas? I use the 0.7.0 alpha.2 version

I know this can be confusing, just ask more often, you are giving the right information :slight_smile: The trie was replaced by the scorer for the 0.7 release. Build the lm.arpa and lm.binary as before, then run generate_package.py with the binary, txt-vocab file and alphabet as inputs:

1 Like

Thanks for fast replying! I already tried to run generate_package.py, as mentioned in the readme but it gave me the following error:
Traceback (most recent call last):
File “generate_package.py”, line 15, in
from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
ImportError: No module named ds_ctcdecoder

Ah, you have to get the native_client and the decoder, try:

python3 util/taskcluster.py --decoder

to download it.

download native client from https://github.com/mozilla/DeepSpeech/releases . you will find generate_trie once you untar. then pass alphabet.txt, lm.binary and path to save the trie as flags to genrate_trie

Which should be the location of this decoder? I placed it in DeepSpeech/data/lm, now I have a brand new error haha! 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer

Sorry, so you run this in the DeepSpeech main folder:

pip install $(python3 util/taskcluster.py --decoder)

It downloads the wheel and places it in the virtualenv you are running

1 Like

Thank you a lot! Sorry for asking so many things but it’s my first time working with something so big, I am working on my bachelor’s degree so I’m pretty noob :smiley:.

Hello! I am still stuck at this error: 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
This appears when I execute this:

`python generate_package.py --alphabet path/alphabet.txt --lm 
path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 
--default_beta 1.85 --package kenlm.scorer`

I trained Chinese model with v0.7.0-alpha3, also faced this problem :rofl:

root@0cb4d86eab66:/Other_version/DeepSpeech/data/lm# python generate_package.py --lm /DeepSpeech/data/lm/lm.binary --vocab /DeepSpeech/data/all/alphabet.txt --package lm.scorer --default_alpha 0.75 --default_beta 1.85

6557 unique words read from vocabulary file.
Looks like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in lm.scorer

@Andreea_Georgiana_Sarca How did you build the lm.binary file? Did you use all arguments set here? Because the command looks fine

Yea well I was not using all the arguments.
So again I tried:
/home/andreea/kenlm/build/bin/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
/home/andreea/kenlm/build/bin/build_binary -a 255 -q 8 -v -trie words.arpa lm.binary

And for this last command I got: Quantization is only implemented in the trie data structure.

When I ran again the generate_package.py as I mentioned before I got exactly the same thing…

The lmplz looks good, I guess you could go with an order lower than 5 if you have just a couple MB, then you could use 2 or 3 and don’t use pruning.

For the build_binary, mine works without the dash at trie:

build_binary -a 255 -q 8 -v trie current/lm.arpa current/lm.binary

So, please trie it without the dash

1 Like

Yep it worked without the dash (thank youu!!!):
Reading words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Identifying n-grams omitted by SRI
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Quantizing
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Writing trie
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


SUCCESS

But generate_package gives me the error :frowning:
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Package created in kenlm.scorer

Now it doesn’t say anything about the missing the header. In the vocabulary file I placed all my transcripts, and they are sentences in Romanian, we have some special characters like ă, î, ş, ţ, â. I also placed them in the alphabet…

Don’t worry about the character model warning. If I remember correctly, that is for languages like Chinese.

4860 words is not much. You might expand this by using Romanian Wikipedia or sth like that.

1 Like