Generating trie

The lmplz looks good, I guess you could go with an order lower than 5 if you have just a couple MB, then you could use 2 or 3 and donโ€™t use pruning.

For the build_binary, mine works without the dash at trie:

build_binary -a 255 -q 8 -v trie current/lm.arpa current/lm.binary

So, please trie it without the dash

1 Like

Yep it worked without the dash (thank youu!!!):
Reading words.arpa
----5โ€”10โ€”15โ€”20โ€”25โ€”30โ€”35โ€”40โ€”45โ€”50โ€”55โ€”60โ€”65โ€”70โ€”75โ€”80โ€”85โ€”90โ€”95โ€“100


Identifying n-grams omitted by SRI
----5โ€”10โ€”15โ€”20โ€”25โ€”30โ€”35โ€”40โ€”45โ€”50โ€”55โ€”60โ€”65โ€”70โ€”75โ€”80โ€”85โ€”90โ€”95โ€“100


Quantizing
----5โ€”10โ€”15โ€”20โ€”25โ€”30โ€”35โ€”40โ€”45โ€”50โ€”55โ€”60โ€”65โ€”70โ€”75โ€”80โ€”85โ€”90โ€”95โ€“100


Writing trie
----5โ€”10โ€”15โ€”20โ€”25โ€”30โ€”35โ€”40โ€”45โ€”50โ€”55โ€”60โ€”65โ€”70โ€”75โ€”80โ€”85โ€”90โ€”95โ€“100


SUCCESS

But generate_package gives me the error :frowning:
4860 unique words read from vocabulary file.
Doesnโ€™t look like a character based model.
Package created in kenlm.scorer

Now it doesnโ€™t say anything about the missing the header. In the vocabulary file I placed all my transcripts, and they are sentences in Romanian, we have some special characters like ฤƒ, รฎ, ลŸ, ลฃ, รข. I also placed them in the alphabetโ€ฆ

Donโ€™t worry about the character model warning. If I remember correctly, that is for languages like Chinese.

4860 words is not much. You might expand this by using Romanian Wikipedia or sth like that.

1 Like

Well at the moment I am training it for a single speaker, I have multiple speakers and around of 17hours of recording provided by my University. I have a lot of transcripts and probably I will be using those. Anyway thank you so much for your help! As I said I am working on my Final Project and I will mention you and this helpful community in there and maybe one day I will be able to also help someone who is training DeepSpeech for Romanian language. :smiley: many many thanks

1 Like

Do you think my .sh file is correctly written?

#!/bin/sh

set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;

python -u DeepSpeech.py \
  --train_files data/train/train.csv \
  --test_files data/test/test.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 100 \
  --epochs 200 \
  --checkpoint_dir "$checkpoint_dir" \
  "$@"

Because it gives me a cute error:
bash: ./bin/run-andreea.sh: Permission denied

I am not an expert of bash, try to run it on the command line, then use a script.

As for DeepSpeech, you should try to increase train and dev batch sizes to speed up training. If you have a GPU, use train_cudnn. What to use for n_hidden varies widely, typically multiples of 2, so maybe 128 or 256. I didnโ€™t have much difference between such values, but I use larger inputs.

If you have the space, store more checkpoints to see whether a previous checkpoint has better results if you run for 100 epochs.

1 Like