is there an easy way to generate trie on custom lm rather than building from native client binaries
please download native_client.tar.xz
from releases page, it contains all you need
i dont find generate_trie in native_client.tar.xz.
i can see the following files in native_client.tar.xz :
LICENSE README.mozilla deepspeech deepspeech.h libdeepspeech.so
Are you looking at 0.6.1 release files?
yes i am looking at 0.6.1 release
Then look at the files for 0.6.1 releases, it’s there.
There are prebuilt versions of the native client, if you cloned the repo, you can run
python3 util/taskcluster.py --target native_client
and it should download it for you. Afterwards you’ll have a generate trie.
thanks @lissyx . generated trie for my custom lm
Hello! I performed python3 util/taskcluster.py --target native_client and generate_trie is still not around I’m pretty stuck at this step for days…any ideas? I use the 0.7.0 alpha.2 version
I know this can be confusing, just ask more often, you are giving the right information The trie was replaced by the scorer for the 0.7 release. Build the lm.arpa and lm.binary as before, then run generate_package.py with the binary, txt-vocab file and alphabet as inputs:
https://github.com/mozilla/DeepSpeech/blob/master/data/lm/generate_package.py
Thanks for fast replying! I already tried to run generate_package.py, as mentioned in the readme but it gave me the following error:
Traceback (most recent call last):
File “generate_package.py”, line 15, in
from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
ImportError: No module named ds_ctcdecoder
Ah, you have to get the native_client and the decoder, try:
python3 util/taskcluster.py --decoder
to download it.
download native client from https://github.com/mozilla/DeepSpeech/releases . you will find generate_trie once you untar. then pass alphabet.txt, lm.binary and path to save the trie as flags to genrate_trie
Which should be the location of this decoder? I placed it in DeepSpeech/data/lm, now I have a brand new error haha! 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
Sorry, so you run this in the DeepSpeech main folder:
pip install $(python3 util/taskcluster.py --decoder)
It downloads the wheel and places it in the virtualenv you are running
Thank you a lot! Sorry for asking so many things but it’s my first time working with something so big, I am working on my bachelor’s degree so I’m pretty noob .
Hello! I am still stuck at this error: 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
This appears when I execute this:
`python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer`
I trained Chinese model with v0.7.0-alpha3, also faced this problem
root@0cb4d86eab66:/Other_version/DeepSpeech/data/lm# python generate_package.py --lm /DeepSpeech/data/lm/lm.binary --vocab /DeepSpeech/data/all/alphabet.txt --package lm.scorer --default_alpha 0.75 --default_beta 1.85
6557 unique words read from vocabulary file.
Looks like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in lm.scorer
@Andreea_Georgiana_Sarca How did you build the lm.binary file? Did you use all arguments set here? Because the command looks fine
Yea well I was not using all the arguments.
So again I tried:
/home/andreea/kenlm/build/bin/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
/home/andreea/kenlm/build/bin/build_binary -a 255 -q 8 -v -trie words.arpa lm.binary
And for this last command I got: Quantization is only implemented in the trie data structure.
When I ran again the generate_package.py as I mentioned before I got exactly the same thing…