I had to, sorry
Looks like you are not using proper version
Or maybe try with released v0.6.1 (if it is enough) and build lm.binary and trie files. Maybe this workflow will work smoother for you.
Version of deepspech ? or some other thing ? . Is it possible to add dictionary to a newer versions of deepspeech ? this is my version : 0.7.0-alpha.2
Version of KenLM. It seems we need polishing on this part of the project
ΠΡΠΎΡΡΠΎ ΡΠ΄Π°Π»ΡΠ΅ΡΡ kenlm ΠΈ ΡΠΊΠ°ΡΠΈΠ²Π°Π΅ΡΡ kenlm ΠΈΠ· Π³ΠΈΡΡ
Π°Π± https://github.com/kpu/kenlm
ΡΡΠ°Π±ΠΎΡΠ°Π΅Ρ.
Just delete kenlm and download kenlm from githab https://github.com/kpu/kenlm.
itβll work.
Update of kenlm does not help. I had the same issue and apparently what was changed with respect to v0.6.1 is that you need to provide -v
argument to build_binary
. The error Error: Canβt parse scorer file, invalid header. Try updating your scorer file.
is not quite helpful here.
Now I get only Doesn't look like a character based model
, but the package creation succeeds
you need a -v argument, itβs all about it.
first step:
git clone https://github.com/kpu/kenlm.git
second step:
mkdir -p build
cd build
cmake β¦
make -j 4
i am using deepspeech v0.61, and i folllowed your steps to operate , but it acturally does not work, and i get the same error :Canβt parse scorer file, invalid header. Try updating your scorer file.
Please follow the documentation matching the version you are working on.
thanks for your apply , i have resolved my problem. thank you!
Hi! I am using DeepSpeech 0.7.0 alpha2. I also receive the error when I am trying generate_package.py
4860 unique words read from vocabulary file.
Doesnβt look like a character based model.
Error: Canβt parse scorer file, invalid header. Try updating your scorer file.
Package created in kenlm.scorer
I donβt understand what is that -v argument that you are talking about and where I should place it. Or whatever is the problem
I performed the following steps so far:
path/lmplz --text vocabulary.txt --arpa words.arpa --o 3
path/build_binary -T -s words.arpa lm.binary
For the generate_package.py :
python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer
Hi! I am using DeepSpeech 0.7.0 alpha1 and I am the same problem. I followed the same steps as you and got the same error response. If you found the solution, please share it here.
The -v
parameter is passed to build_binary
. Now that 0.7.0 is out you can use the generate_lm.py
script which already handles that: https://github.com/mozilla/DeepSpeech/tree/v0.7.0/data/lm
How I fixed my problem:
path/to/lmplz --order 5 --temp_prefix tmp --memory 50% --text vocab.txt --arpa words.arpa --prune 0 0 1
Then:
path/to/build_binary -a 255 -q 8 -v trie words.arpa lm.binary
Now make sure you are in lm directory and:
python generate_package.py --alphabet path/alphabet.txt --lm path/lm.binary --vocab path/vocab.txt --default_alpha 0.75 --default_beta 1.85 --package kenlm.scorer
Thanks for answering.
Thanks for answering.
In the first command, i removed --temp_prefix tmp
to it work
My return:
`
=== 1/5 Counting and sorting n-grams ===
Reading /home/vocabulary.txt
----5β10β15β20β25β30β35β40β45β50β55β60β65β70β75β80β85β90β95β100
Unigram tokens 102154 types 12284
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:147408 2:172963040 3:324305728 4:518889152 5:756713408
Statistics:
1 12284 D1=0.644877 D2=1.08174 D3+=1.27482
2 54573 D1=0.804023 D2=1.13055 D3+=1.47694
3 8419/79369 D1=0.903966 D2=1.27706 D3+=1.46927
4 3824/81367 D1=0.956851 D2=1.44009 D3+=1.68743
5 1647/72999 D1=0.964164 D2=1.60299 D3+=1.94048
Memory estimate for binary LM:
type kB
probing 1906 assuming -p 1.5
probing 2346 assuming -r models -p 1.5
trie 1035 without quantization
trie 654 assuming -q 8 -b 8 quantization
trie 962 assuming -a 22 array pointer compression
trie 581 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:147408 2:873168 3:168380 4:91776 5:46116
----5β10β15β20β25β30β35β40β45β50β55β60β65β70β75β80β85β90β95β100
**##################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:147408 2:873168 3:168380 4:91776 5:46116
----5β10β15β20β25β30β35β40β45β50β55β60β65β70β75β80β85β90β95β100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5β10β15β20β25β30β35β40β45β50β55β60β65β70β75β80β85β90β95β100
Name:lmplz VmPeak:1906392 kB VmRSS:8664 kB RSSMax:333128 kB user:0.154804 sys:0.204341 CPU:0.359177 real:0.458181
`
However, when I tried to use the second command, my return was as follows:
`
Reading words.arpa
----5β10β15β20β25β30β35β40β45β50β55β60β65β70β75β80β85β90β95β100
/home/kenlm/util/file.cc:133 in void util::ResizeOrThrow(int, uint64_t) threw FDException because ret'. Operation not permitted in /home/lm.binaryeZF4t8 (deleted) while resizing to 0 bytes Byte: 75 ERROR
I am using a Docker container with debian 9. Does anyone have any idea what this could be?
What commands did you run exactly, how big are the files and how big is the server?
Hi,
The first command that I ran was this:
path/to/lmplz --order 5 --memory 85% --text vocabulary.txt --arpa words.arpa --prune 0 0 1
apparently, everything went well. The file vocabulary.txt
has 584KB corresponds to all transcript audios from the Common Voice Dataset from Brasilian Portuguese language. The container has 2GB RAM.
The second command was this:
path/to/build_binary -a 255 -q 8 -v trie words.arpa lm.binary
The error response:
/home/kenlm/util/file.cc:133 in void util::ResizeOrThrow(int, uint64_t) threw FDException because ret'. Operation not permitted in /home/lm.binaryeZF4t8 (deleted) while resizing to 0 bytes Byte: 75 ERROR