Generating trie

othiele · April 2, 2020, 1:54pm

The lmplz looks good, I guess you could go with an order lower than 5 if you have just a couple MB, then you could use 2 or 3 and don’t use pruning.

For the build_binary, mine works without the dash at trie:

build_binary -a 255 -q 8 -v trie current/lm.arpa current/lm.binary

So, please trie it without the dash

Andreea_Georgiana_Sarca · April 3, 2020, 6:56am

Yep it worked without the dash (thank youu!!!):
Reading words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

Identifying n-grams omitted by SRI
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

Quantizing
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

Writing trie
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

SUCCESS

But generate_package gives me the error
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Package created in kenlm.scorer

Now it doesn’t say anything about the missing the header. In the vocabulary file I placed all my transcripts, and they are sentences in Romanian, we have some special characters like ă, î, ş, ţ, â. I also placed them in the alphabet…

othiele · April 3, 2020, 8:35am

Don’t worry about the character model warning. If I remember correctly, that is for languages like Chinese.

4860 words is not much. You might expand this by using Romanian Wikipedia or sth like that.

Andreea_Georgiana_Sarca · April 3, 2020, 9:51am

Well at the moment I am training it for a single speaker, I have multiple speakers and around of 17hours of recording provided by my University. I have a lot of transcripts and probably I will be using those. Anyway thank you so much for your help! As I said I am working on my Final Project and I will mention you and this helpful community in there and maybe one day I will be able to also help someone who is training DeepSpeech for Romanian language. many many thanks

Andreea_Georgiana_Sarca · April 3, 2020, 11:20am

Do you think my .sh file is correctly written?

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;

python -u DeepSpeech.py \
  --train_files data/train/train.csv \
  --test_files data/test/test.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 200 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

Because it gives me a cute error:
bash: ./bin/run-andreea.sh: Permission denied

othiele · April 3, 2020, 11:55am

I am not an expert of bash, try to run it on the command line, then use a script.

As for DeepSpeech, you should try to increase train and dev batch sizes to speed up training. If you have a GPU, use train_cudnn. What to use for n_hidden varies widely, typically multiples of 2, so maybe 128 or 256. I didn’t have much difference between such values, but I use larger inputs.

If you have the space, store more checkpoints to see whether a previous checkpoint has better results if you run for 100 epochs.

Topic		Replies	Views
Build the generate_trie binary DeepSpeech	9	1587	February 24, 2020
Language Model Creation DeepSpeech	24	3941	October 18, 2019
Creation of language model and trie DeepSpeech	28	12814	August 7, 2019
Problems creating Trie file DeepSpeech	9	990	March 27, 2020
Trie file creation DeepSpeech	11	751	August 6, 2020

Generating trie

Related topics