There are prebuilt versions of the native client, if you cloned the repo, you can run
python3 util/taskcluster.py --target native_client
and it should download it for you. Afterwards you’ll have a generate trie.
There are prebuilt versions of the native client, if you cloned the repo, you can run
python3 util/taskcluster.py --target native_client
and it should download it for you. Afterwards you’ll have a generate trie.
Hello! I performed python3 util/taskcluster.py --target native_client and generate_trie is still not around I’m pretty stuck at this step for days…any ideas? I use the 0.7.0 alpha.2 version
I know this can be confusing, just ask more often, you are giving the right information The trie was replaced by the scorer for the 0.7 release. Build the lm.arpa and lm.binary as before, then run generate_package.py with the binary, txt-vocab file and alphabet as inputs:
Thanks for fast replying! I already tried to run generate_package.py, as mentioned in the readme but it gave me the following error:
Traceback (most recent call last):
File “generate_package.py”, line 15, in
from ds_ctcdecoder import Scorer, Alphabet as NativeAlphabet
ImportError: No module named ds_ctcdecoder
Ah, you have to get the native_client and the decoder, try:
python3 util/taskcluster.py --decoder
to download it.
download native client from https://github.com/mozilla/DeepSpeech/releases . you will find generate_trie once you untar. then pass alphabet.txt, lm.binary and path to save the trie as flags to genrate_trie
Which should be the location of this decoder? I placed it in DeepSpeech/data/lm, now I have a brand new error haha! 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
Sorry, so you run this in the DeepSpeech main folder:
pip install $(python3 util/taskcluster.py --decoder)
It downloads the wheel and places it in the virtualenv you are running
Thank you a lot! Sorry for asking so many things but it’s my first time working with something so big, I am working on my bachelor’s degree so I’m pretty noob .
Hello! I am still stuck at this error: 4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
This appears when I execute this:
`python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer`
I trained Chinese model with v0.7.0-alpha3, also faced this problem
root@0cb4d86eab66:/Other_version/DeepSpeech/data/lm# python generate_package.py --lm /DeepSpeech/data/lm/lm.binary --vocab /DeepSpeech/data/all/alphabet.txt --package lm.scorer --default_alpha 0.75 --default_beta 1.85
6557 unique words read from vocabulary file.
Looks like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in lm.scorer
@Andreea_Georgiana_Sarca How did you build the lm.binary file? Did you use all arguments set here? Because the command looks fine
Yea well I was not using all the arguments.
So again I tried:
/home/andreea/kenlm/build/bin/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
/home/andreea/kenlm/build/bin/build_binary -a 255 -q 8 -v -trie words.arpa lm.binary
And for this last command I got: Quantization is only implemented in the trie data structure.
When I ran again the generate_package.py as I mentioned before I got exactly the same thing…
The lmplz looks good, I guess you could go with an order lower than 5 if you have just a couple MB, then you could use 2 or 3 and don’t use pruning.
For the build_binary, mine works without the dash at trie:
build_binary -a 255 -q 8 -v trie current/lm.arpa current/lm.binary
So, please trie it without the dash
Yep it worked without the dash (thank youu!!!):
Reading words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
Identifying n-grams omitted by SRI
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
Quantizing
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
Writing trie
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
SUCCESS
But generate_package gives me the error
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Package created in kenlm.scorer
Now it doesn’t say anything about the missing the header. In the vocabulary file I placed all my transcripts, and they are sentences in Romanian, we have some special characters like ă, î, ş, ţ, â. I also placed them in the alphabet…
Don’t worry about the character model warning. If I remember correctly, that is for languages like Chinese.
4860 words is not much. You might expand this by using Romanian Wikipedia or sth like that.
Well at the moment I am training it for a single speaker, I have multiple speakers and around of 17hours of recording provided by my University. I have a lot of transcripts and probably I will be using those. Anyway thank you so much for your help! As I said I am working on my Final Project and I will mention you and this helpful community in there and maybe one day I will be able to also help someone who is training DeepSpeech for Romanian language. many many thanks
Do you think my .sh file is correctly written?
#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
echo "Please make sure you run this from DeepSpeech's top level directory."
exit 1
fi;
python -u DeepSpeech.py \
--train_files data/train/train.csv \
--test_files data/test/test.csv \
--train_batch_size 1 \
--test_batch_size 1 \
--n_hidden 100 \
--epochs 200 \
--checkpoint_dir "$checkpoint_dir" \
"$@"
Because it gives me a cute error:
bash: ./bin/run-andreea.sh: Permission denied
I am not an expert of bash, try to run it on the command line, then use a script.
As for DeepSpeech, you should try to increase train and dev batch sizes to speed up training. If you have a GPU, use train_cudnn. What to use for n_hidden varies widely, typically multiples of 2, so maybe 128 or 256. I didn’t have much difference between such values, but I use larger inputs.
If you have the space, store more checkpoints to see whether a previous checkpoint has better results if you run for 100 epochs.