Generating own scorer file

omarov-abai999 · March 13, 2020, 1:15pm

you need a -v argument, it’s all about it.
first step:
git clone https://github.com/kpu/kenlm.git

second step:
mkdir -p build
cd build
cmake …
make -j 4

zhangpeng_K · March 17, 2020, 6:20am

i am using deepspeech v0.61, and i folllowed your steps to operate , but it acturally does not work, and i get the same error :Can’t parse scorer file, invalid header. Try updating your scorer file.

lissyx · March 17, 2020, 8:37am

Please follow the documentation matching the version you are working on.

zhangpeng_K · March 19, 2020, 3:45am

thanks for your apply , i have resolved my problem. thank you!

Andreea_Georgiana_Sarca · March 28, 2020, 5:49pm

Hi! I am using DeepSpeech 0.7.0 alpha2. I also receive the error when I am trying generate_package.py
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
I don’t understand what is that -v argument that you are talking about and where I should place it. Or whatever is the problem
I performed the following steps so far:
path/lmplz --text vocabulary.txt --arpa words.arpa --o 3
path/build_binary -T -s words.arpa lm.binary

For the generate_package.py :
python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer

Gabriel_Souza · May 6, 2020, 1:46pm

Hi! I am using DeepSpeech 0.7.0 alpha1 and I am the same problem. I followed the same steps as you and got the same error response. If you found the solution, please share it here.

reuben · May 6, 2020, 1:51pm

The -v parameter is passed to build_binary. Now that 0.7.0 is out you can use the generate_lm.py script which already handles that: https://github.com/mozilla/DeepSpeech/tree/v0.7.0/data/lm

Andreea_Georgiana_Sarca · May 6, 2020, 2:05pm

How I fixed my problem:

path/to/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
path/to/build_binary -a 255 -q 8 -v trie words.arpa lm.binary

Now make sure you are in lm directory and:

python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer

othiele · May 6, 2020, 3:25pm

Thanks for answering.

Gabriel_Souza · May 7, 2020, 5:21pm

Thanks for answering.
In the first command, i removed --temp_prefix tmp to it work
My return:
`
=== 1/5 Counting and sorting n-grams ===
Reading /home/vocabulary.txt
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

Unigram tokens 102154 types 12284
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:147408 2:172963040 3:324305728 4:518889152 5:756713408
Statistics:
1 12284 D1=0.644877 D2=1.08174 D3+=1.27482
2 54573 D1=0.804023 D2=1.13055 D3+=1.47694
3 8419/79369 D1=0.903966 D2=1.27706 D3+=1.46927
4 3824/81367 D1=0.956851 D2=1.44009 D3+=1.68743
5 1647/72999 D1=0.964164 D2=1.60299 D3+=1.94048
Memory estimate for binary LM:
type kB
probing 1906 assuming -p 1.5
probing 2346 assuming -r models -p 1.5
trie 1035 without quantization
trie 654 assuming -q 8 -b 8 quantization
trie 962 assuming -a 22 array pointer compression
trie 581 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:147408 2:873168 3:168380 4:91776 5:46116
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
**##################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:147408 2:873168 3:168380 4:91776 5:46116
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

Name:lmplz VmPeak:1906392 kB VmRSS:8664 kB RSSMax:333128 kB user:0.154804 sys:0.204341 CPU:0.359177 real:0.458181
`

However, when I tried to use the second command, my return was as follows:

`
Reading words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100

/home/kenlm/util/file.cc:133 in void util::ResizeOrThrow(int, uint64_t) threw FDException because ret'. Operation not permitted in /home/lm.binaryeZF4t8 (deleted) while resizing to 0 bytes Byte: 75 ERROR

I am using a Docker container with debian 9. Does anyone have any idea what this could be?

othiele · May 7, 2020, 8:08pm

What commands did you run exactly, how big are the files and how big is the server?

Gabriel_Souza · May 7, 2020, 9:24pm

Hi,
The first command that I ran was this:
path/to/lmplz --order 5 --memory 85% --text vocabulary.txt --arpa words.arpa --prune 0 0 1
apparently, everything went well. The file vocabulary.txt has 584KB corresponds to all transcript audios from the Common Voice Dataset from Brasilian Portuguese language. The container has 2GB RAM.

The second command was this:
path/to/build_binary -a 255 -q 8 -v trie words.arpa lm.binary

The error response:

/home/kenlm/util/file.cc:133 in void util::ResizeOrThrow(int, uint64_t) threw FDException because ret'. Operation not permitted in /home/lm.binaryeZF4t8 (deleted) while resizing to 0 bytes Byte: 75 ERROR

othiele · May 8, 2020, 6:47am

Try with order 2 and without prune and leave out the a and q. With 600 kb you have almost no material at all and those params are for gb of data.

Gabriel_Souza · May 8, 2020, 10:46am

I tried again using the following commands:
kenlm/build/bin/lmplz --order 2 --text vocabulary.txt --arpa words.arpa

after generating the file words.arpa, I ran:

kenlm/build/bin/build_binary words.arpa lm.binary

I needed to remove the tag -v trie because the error persisted. Thus, I managed to generate the file lm.binary successfully. However, when I tried to generate the kenlm.score file using the command:

python3 generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocabulary.txt --package kenlm.scorer --default_alpha 0.75 --default_beta 1.18

I got the following error:

12281 unique words read from vocabulary file.
Doesn’t look like a character based model.
Using detected UTF-8 mode: False
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer

I’ve tried several ways to generate the lm.binary file, but whenever I can, for some reason I can’t generate the kenlm.score file.

Thanks for your help

othiele · May 8, 2020, 11:47am

@lissyx have you had this error before?

lissyx · May 8, 2020, 12:45pm

no, no idea what that is, but it seems orthogonal to deepspeech

othiele · May 8, 2020, 1:39pm

This error message sounds strange, try a different directory, other txt inputs. Maybe disk qouta or special chars in file? KenLm has always worked for me and I built 100+ language models without this error.

reuben · May 8, 2020, 6:32pm

Leaving out the -v flag is what caused the generate_package error.

Gabriel_Souza · May 11, 2020, 11:52am

@othiele @lissyx @reuben @Andreea_Georgiana_Sarca Thanks for answering.
I solved the problem. The main problem was the folder in which kenlm was located which did not allow the creation of temporary files, so I couldn’t create the lm.binary file correctly.

Nitin_Agarwal · November 10, 2020, 2:20pm

249 unique words read from vocabulary file.
Doesn’t look like a character based (Bytes Are All You Need) model.
–force_utf8 was not specified, using value infered from vocabulary contents: false
Package created in domino_only.scorer

I am facing this issue with v0.8.2 version, where instead of generate_package.py package file - there is a different way of doing things i.e. using generate_scorer_package binary.

The script i am using is this:

python generate_lm.py \
--input_txt /data/dominos_full_lm.txt \
--output_dir . \
--top_k 5000 \
--discount_fallback \
--kenlm_bins /DeepSpeech/native_client/kenlm/build/bin \
--arpa_order 5 \
--max_arpa_memory "85%" \
--arpa_prune "0|0|1" \
--binary_a_bits 255 \
--binary_q_bits 8 \
--binary_type trie

/DeepSpeech/tensorflow/bazel-bin/native_client/generate_scorer_package
–alphabet /data/alphabet.txt
–lm lm.binary
–vocab vocab-5000.txt
–package domino_only.scorer
–default_alpha 1.2248
–default_beta 2.04874