Generating trie

is there an easy way to generate trie on custom lm rather than building from native client binaries

please download native_client.tar.xz from releases page, it contains all you need

i dont find generate_trie in native_client.tar.xz.
i can see the following files in native_client.tar.xz :
LICENSE README.mozilla deepspeech deepspeech.h libdeepspeech.so

Are you looking at 0.6.1 release files?

yes i am looking at 0.6.1 release

Then look at the files for 0.6.1 releases, it’s there.

There are prebuilt versions of the native client, if you cloned the repo, you can run

python3 util/taskcluster.py --target native_client

and it should download it for you. Afterwards you’ll have a generate trie.

2 Likes

thanks @lissyx . generated trie for my custom lm

1 Like

Hello! I performed python3 util/taskcluster.py --target native_client and generate_trie is still not around :frowning: I’m pretty stuck at this step for days…any ideas? I use the 0.7.0 alpha.2 version

I know this can be confusing, just ask more often, you are giving the right information :slight_smile: The trie was replaced by the scorer for the 0.7 release. Build the lm.arpa and lm.binary as before, then run generate_package.py with the binary, txt-vocab file and alphabet as inputs:

1 Like

Thanks for fast replying! I already tried to run generate_package.py, as mentioned in the readme but it gave me the following error:
Traceback (most recent call last):
File “generate_package.py”, line 15, in
from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
ImportError: No module named ds_ctcdecoder

Ah, you have to get the native_client and the decoder, try:

python3 util/taskcluster.py --decoder

to download it.

download native client from https://github.com/mozilla/DeepSpeech/releases . you will find generate_trie once you untar. then pass alphabet.txt, lm.binary and path to save the trie as flags to genrate_trie

Which should be the location of this decoder? I placed it in DeepSpeech/data/lm, now I have a brand new error haha! 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer

Sorry, so you run this in the DeepSpeech main folder:

pip install $(python3 util/taskcluster.py --decoder)

It downloads the wheel and places it in the virtualenv you are running

1 Like

Thank you a lot! Sorry for asking so many things but it’s my first time working with something so big, I am working on my bachelor’s degree so I’m pretty noob :smiley:.

Hello! I am still stuck at this error: 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
This appears when I execute this:

`python generate_package.py --alphabet path/alphabet.txt --lm 
path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 
--default_beta 1.85 --package kenlm.scorer`

I trained Chinese model with v0.7.0-alpha3, also faced this problem :rofl:

root@0cb4d86eab66:/Other_version/DeepSpeech/data/lm# python generate_package.py --lm /DeepSpeech/data/lm/lm.binary --vocab /DeepSpeech/data/all/alphabet.txt --package lm.scorer --default_alpha 0.75 --default_beta 1.85

6557 unique words read from vocabulary file.
Looks like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in lm.scorer

@Andreea_Georgiana_Sarca How did you build the lm.binary file? Did you use all arguments set here? Because the command looks fine

Yea well I was not using all the arguments.
So again I tried:
/home/andreea/kenlm/build/bin/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
/home/andreea/kenlm/build/bin/build_binary -a 255 -q 8 -v -trie words.arpa lm.binary

And for this last command I got: Quantization is only implemented in the trie data structure.

When I ran again the generate_package.py as I mentioned before I got exactly the same thing…