Generating trie

There are prebuilt versions of the native client, if you cloned the repo, you can run

python3 util/taskcluster.py --target native_client

and it should download it for you. Afterwards you’ll have a generate trie.

2 Likes

thanks @lissyx . generated trie for my custom lm

1 Like

Hello! I performed python3 util/taskcluster.py --target native_client and generate_trie is still not around :frowning: I’m pretty stuck at this step for days…any ideas? I use the 0.7.0 alpha.2 version

I know this can be confusing, just ask more often, you are giving the right information :slight_smile: The trie was replaced by the scorer for the 0.7 release. Build the lm.arpa and lm.binary as before, then run generate_package.py with the binary, txt-vocab file and alphabet as inputs:

1 Like

Thanks for fast replying! I already tried to run generate_package.py, as mentioned in the readme but it gave me the following error:
Traceback (most recent call last):
File “generate_package.py”, line 15, in
from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
ImportError: No module named ds_ctcdecoder

Ah, you have to get the native_client and the decoder, try:

python3 util/taskcluster.py --decoder

to download it.

download native client from https://github.com/mozilla/DeepSpeech/releases . you will find generate_trie once you untar. then pass alphabet.txt, lm.binary and path to save the trie as flags to genrate_trie

Which should be the location of this decoder? I placed it in DeepSpeech/data/lm, now I have a brand new error haha! 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer

Sorry, so you run this in the DeepSpeech main folder:

pip install $(python3 util/taskcluster.py --decoder)

It downloads the wheel and places it in the virtualenv you are running

1 Like

Thank you a lot! Sorry for asking so many things but it’s my first time working with something so big, I am working on my bachelor’s degree so I’m pretty noob :smiley:.

Hello! I am still stuck at this error: 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
This appears when I execute this:

`python generate_package.py --alphabet path/alphabet.txt --lm 
path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 
--default_beta 1.85 --package kenlm.scorer`

I trained Chinese model with v0.7.0-alpha3, also faced this problem :rofl:

root@0cb4d86eab66:/Other_version/DeepSpeech/data/lm# python generate_package.py --lm /DeepSpeech/data/lm/lm.binary --vocab /DeepSpeech/data/all/alphabet.txt --package lm.scorer --default_alpha 0.75 --default_beta 1.85

6557 unique words read from vocabulary file.
Looks like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in lm.scorer

@Andreea_Georgiana_Sarca How did you build the lm.binary file? Did you use all arguments set here? Because the command looks fine

Yea well I was not using all the arguments.
So again I tried:
/home/andreea/kenlm/build/bin/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
/home/andreea/kenlm/build/bin/build_binary -a 255 -q 8 -v -trie words.arpa lm.binary

And for this last command I got: Quantization is only implemented in the trie data structure.

When I ran again the generate_package.py as I mentioned before I got exactly the same thing…

The lmplz looks good, I guess you could go with an order lower than 5 if you have just a couple MB, then you could use 2 or 3 and don’t use pruning.

For the build_binary, mine works without the dash at trie:

build_binary -a 255 -q 8 -v trie current/lm.arpa current/lm.binary

So, please trie it without the dash

1 Like

Yep it worked without the dash (thank youu!!!):
Reading words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Identifying n-grams omitted by SRI
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Quantizing
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Writing trie
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


SUCCESS

But generate_package gives me the error :frowning:
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Package created in kenlm.scorer

Now it doesn’t say anything about the missing the header. In the vocabulary file I placed all my transcripts, and they are sentences in Romanian, we have some special characters like ă, î, ş, ţ, â. I also placed them in the alphabet…

Don’t worry about the character model warning. If I remember correctly, that is for languages like Chinese.

4860 words is not much. You might expand this by using Romanian Wikipedia or sth like that.

1 Like

Well at the moment I am training it for a single speaker, I have multiple speakers and around of 17hours of recording provided by my University. I have a lot of transcripts and probably I will be using those. Anyway thank you so much for your help! As I said I am working on my Final Project and I will mention you and this helpful community in there and maybe one day I will be able to also help someone who is training DeepSpeech for Romanian language. :smiley: many many thanks

1 Like

Do you think my .sh file is correctly written?

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;

python -u DeepSpeech.py \
  --train_files data/train/train.csv \
  --test_files data/test/test.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 200 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

Because it gives me a cute error:
bash: ./bin/run-andreea.sh: Permission denied

I am not an expert of bash, try to run it on the command line, then use a script.

As for DeepSpeech, you should try to increase train and dev batch sizes to speed up training. If you have a GPU, use train_cudnn. What to use for n_hidden varies widely, typically multiples of 2, so maybe 128 or 256. I didn’t have much difference between such values, but I use larger inputs.

If you have the space, store more checkpoints to see whether a previous checkpoint has better results if you run for 100 epochs.

1 Like