Generating own scorer file

Hi. here im goin to create a scorer file but it failed with message:

9 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.

Цитата

I cound find any related links to get help. Please help me. Thank you

I could not find any steps to repro in your message. Hard to check what you did.

This is my first step

 %cd /content/kenlm/build/bin
 !./lmplz --order 5 --memory 50% --temp_prefix 15 --text /content/deepspeech/uzbek/dictionary.txt --arpa  /content/deepspeech/uzbek/dictionary.arpa  --discount_fallback --prune 0 0 1
 !./build_binary -a 255 -s -q 8 trie   /content/deepspeech/uzbek/dictionary.arpa  /content/deepspeech/uzbek/lm.binary

This is a second one. below error appears after that

%cd /content/deepspeech/       
!python ./data/lm/generate_package.py --alphabet /content/deepspeech/uzbek/alphabet.txt --lm /content/deepspeech/uzbek/lm.binary --vocab /content/deepspeech/uzbek/dictionary.txt  --default_alpha 0.75 --default_beta 1.85 --package /content/deepspeech/uzbek/uzbek.scorer

What’s the content of that file ?

@Akmal_Nodirov Also, what exact commit are you on ? What’s pip list | grep ds_ctcdecoder ?

the content is inside the file:
asslomu aleykum do’stim bu men ismim Akmal Ozodbek Shahzod

what do you mean by exact commit ? ihave cloned last version of deepspeech. its 0.7 or upper than that. the last one

What’s your HEAD at.

I don’t see the -v we document in data/lm/generate_lm.py for that call and that generate_scorer.py --help advises you to ensure.

i think its this. Because i have cloned it approximately 9 hours ago

I have added -v:
!./build_binary -a 255 -s -q 8 -v trie /content/deepspeech/uzbek/dictionary.arpa /content/deepspeech/uzbek/lm.binary

but appears this error:
./build_binary: invalid option – ‘v’
Usage: ./build_binary [-u log10_unknown_probability] [-s] [-i] [-w mmap|after] [-p probing_multiplier]

or could you provide me a correct format of generating file, if you could all steps please. thank you

I had to, sorry :smiley:

5 Likes

Looks like you are not using proper version

@Akmal_Nodirov Try and rebuild build_binary and others from KenLM master?

Or maybe try with released v0.6.1 (if it is enough) and build lm.binary and trie files. Maybe this workflow will work smoother for you.

Version of deepspech ? or some other thing ? . Is it possible to add dictionary to a newer versions of deepspeech ? this is my version : 0.7.0-alpha.2

Version of KenLM. It seems we need polishing on this part of the project :confused:

Просто удаляешь kenlm и скачиваешь kenlm из гитхаб https://github.com/kpu/kenlm
сработает.
Just delete kenlm and download kenlm from githab https://github.com/kpu/kenlm.
it’ll work.

Update of kenlm does not help. I had the same issue and apparently what was changed with respect to v0.6.1 is that you need to provide -v argument to build_binary. The error Error: Can’t parse scorer file, invalid header. Try updating your scorer file. is not quite helpful here.

Now I get only Doesn't look like a character based model, but the package creation succeeds :smiley: