Generating own scorer file

Looks like you are not using proper version

@Akmal_Nodirov Try and rebuild build_binary and others from KenLM master?

Or maybe try with released v0.6.1 (if it is enough) and build lm.binary and trie files. Maybe this workflow will work smoother for you.

Version of deepspech ? or some other thing ? . Is it possible to add dictionary to a newer versions of deepspeech ? this is my version : 0.7.0-alpha.2

Version of KenLM. It seems we need polishing on this part of the project :confused:

Просто удаляешь kenlm и скачиваешь kenlm из гитхаб https://github.com/kpu/kenlm
сработает.
Just delete kenlm and download kenlm from githab https://github.com/kpu/kenlm.
it’ll work.

Update of kenlm does not help. I had the same issue and apparently what was changed with respect to v0.6.1 is that you need to provide -v argument to build_binary. The error Error: Can’t parse scorer file, invalid header. Try updating your scorer file. is not quite helpful here.

Now I get only Doesn't look like a character based model, but the package creation succeeds :smiley:

you need a -v argument, it’s all about it.
first step:
git clone https://github.com/kpu/kenlm.git

second step:
mkdir -p build
cd build
cmake …
make -j 4

i am using deepspeech v0.61, and i folllowed your steps to operate , but it acturally does not work, and i get the same error :Can’t parse scorer file, invalid header. Try updating your scorer file.

Please follow the documentation matching the version you are working on.

thanks for your apply , i have resolved my problem. thank you!

Hi! I am using DeepSpeech 0.7.0 alpha2. I also receive the error when I am trying generate_package.py
4860 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
I don’t understand what is that -v argument that you are talking about and where I should place it. Or whatever is the problem :frowning:
I performed the following steps so far:
path/lmplz --text vocabulary.txt --arpa words.arpa --o 3
path/build_binary -T -s words.arpa lm.binary

For the generate_package.py :
python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer

Hi! I am using DeepSpeech 0.7.0 alpha1 and I am the same problem. I followed the same steps as you and got the same error response. If you found the solution, please share it here.

The -v parameter is passed to build_binary. Now that 0.7.0 is out you can use the generate_lm.py script which already handles that: https://github.com/mozilla/DeepSpeech/tree/v0.7.0/data/lm

1 Like

How I fixed my problem:

path/to/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
path/to/build_binary -a 255 -q 8 -v trie words.arpa lm.binary

Now make sure you are in lm directory and:

python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer

3 Likes

Thanks for answering.

2 Likes

Thanks for answering.
In the first command, i removed --temp_prefix tmp to it work
My return:
`
=== 1/5 Counting and sorting n-grams ===
Reading /home/vocabulary.txt
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Unigram tokens 102154 types 12284
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:147408 2:172963040 3:324305728 4:518889152 5:756713408
Statistics:
1 12284 D1=0.644877 D2=1.08174 D3+=1.27482
2 54573 D1=0.804023 D2=1.13055 D3+=1.47694
3 8419/79369 D1=0.903966 D2=1.27706 D3+=1.46927
4 3824/81367 D1=0.956851 D2=1.44009 D3+=1.68743
5 1647/72999 D1=0.964164 D2=1.60299 D3+=1.94048
Memory estimate for binary LM:
type kB
probing 1906 assuming -p 1.5
probing 2346 assuming -r models -p 1.5
trie 1035 without quantization
trie 654 assuming -q 8 -b 8 quantization
trie 962 assuming -a 22 array pointer compression
trie 581 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:147408 2:873168 3:168380 4:91776 5:46116
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
**##################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:147408 2:873168 3:168380 4:91776 5:46116
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


Name:lmplz VmPeak:1906392 kB VmRSS:8664 kB RSSMax:333128 kB user:0.154804 sys:0.204341 CPU:0.359177 real:0.458181
`

However, when I tried to use the second command, my return was as follows:

`
Reading words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100


/home/kenlm/util/file.cc:133 in void util::ResizeOrThrow(int, uint64_t) threw FDException because ret'. Operation not permitted in /home/lm.binaryeZF4t8 (deleted) while resizing to 0 bytes Byte: 75 ERROR

I am using a Docker container with debian 9. Does anyone have any idea what this could be?

What commands did you run exactly, how big are the files and how big is the server?

1 Like

Hi,
The first command that I ran was this:
path/to/lmplz --order 5 --memory 85% --text vocabulary.txt --arpa words.arpa --prune 0 0 1
apparently, everything went well. The file vocabulary.txt has 584KB corresponds to all transcript audios from the Common Voice Dataset from Brasilian Portuguese language. The container has 2GB RAM.

The second command was this:
path/to/build_binary -a 255 -q 8 -v trie words.arpa lm.binary

The error response:

/home/kenlm/util/file.cc:133 in void util::ResizeOrThrow(int, uint64_t) threw FDException because ret'. Operation not permitted in /home/lm.binaryeZF4t8 (deleted) while resizing to 0 bytes Byte: 75 ERROR

Try with order 2 and without prune and leave out the a and q. With 600 kb you have almost no material at all and those params are for gb of data.