Generating own scorer file

Akmal_Nodirov · February 27, 2020, 12:16pm

Hi. here im goin to create a scorer file but it failed with message:

9 unique words read from vocabulary file.
Doesn’t look like a character based model.
Error: Can’t parse scorer file, invalid header. Try updating your scorer file.

Цитата

I cound find any related links to get help. Please help me. Thank you

lissyx · February 27, 2020, 12:43pm

I could not find any steps to repro in your message. Hard to check what you did.

Akmal_Nodirov · February 27, 2020, 12:49pm

This is my first step

 %cd /content/kenlm/build/bin
 !./lmplz --order 5 --memory 50% --temp_prefix 15 --text /content/deepspeech/uzbek/dictionary.txt --arpa  /content/deepspeech/uzbek/dictionary.arpa  --discount_fallback --prune 0 0 1
 !./build_binary -a 255 -s -q 8 trie   /content/deepspeech/uzbek/dictionary.arpa  /content/deepspeech/uzbek/lm.binary

This is a second one. below error appears after that

%cd /content/deepspeech/       
!python ./data/lm/generate_package.py --alphabet /content/deepspeech/uzbek/alphabet.txt --lm /content/deepspeech/uzbek/lm.binary --vocab /content/deepspeech/uzbek/dictionary.txt  --default_alpha 0.75 --default_beta 1.85 --package /content/deepspeech/uzbek/uzbek.scorer

lissyx · February 27, 2020, 12:56pm

What’s the content of that file ?

lissyx · February 27, 2020, 12:58pm

@Akmal_Nodirov Also, what exact commit are you on ? What’s pip list | grep ds_ctcdecoder ?

Akmal_Nodirov · February 27, 2020, 12:59pm

the content is inside the file:
asslomu aleykum do’stim bu men ismim Akmal Ozodbek Shahzod

Akmal_Nodirov · February 27, 2020, 1:02pm

what do you mean by exact commit ? ihave cloned last version of deepspeech. its 0.7 or upper than that. the last one

lissyx · February 27, 2020, 1:03pm

What’s your HEAD at.

lissyx · February 27, 2020, 1:05pm

I don’t see the -v we document in data/lm/generate_lm.py for that call and that generate_scorer.py --help advises you to ensure.

Akmal_Nodirov · February 27, 2020, 1:10pm

i think its this. Because i have cloned it approximately 9 hours ago

Akmal_Nodirov · February 27, 2020, 1:48pm

I have added -v:
!./build_binary -a 255 -s -q 8 -v trie /content/deepspeech/uzbek/dictionary.arpa /content/deepspeech/uzbek/lm.binary

but appears this error:
./build_binary: invalid option – ‘v’
Usage: ./build_binary [-u log10_unknown_probability] [-s] [-i] [-w mmap|after] [-p probing_multiplier]

Akmal_Nodirov · February 27, 2020, 1:53pm

or could you provide me a correct format of generating file, if you could all steps please. thank you

Jendker · February 27, 2020, 2:00pm

I had to, sorry

lissyx · February 27, 2020, 2:41pm

Looks like you are not using proper version

lissyx · February 27, 2020, 2:42pm

@Akmal_Nodirov Try and rebuild build_binary and others from KenLM master?

Jendker · February 27, 2020, 3:30pm

Or maybe try with released v0.6.1 (if it is enough) and build lm.binary and trie files. Maybe this workflow will work smoother for you.

Akmal_Nodirov · February 28, 2020, 4:15am

Version of deepspech ? or some other thing ? . Is it possible to add dictionary to a newer versions of deepspeech ? this is my version : 0.7.0-alpha.2

lissyx · February 28, 2020, 7:59am

Version of KenLM. It seems we need polishing on this part of the project

omarov-abai999 · March 10, 2020, 5:35am

Просто удаляешь kenlm и скачиваешь kenlm из гитхаб https://github.com/kpu/kenlm
сработает.
Just delete kenlm and download kenlm from githab https://github.com/kpu/kenlm.
it’ll work.

Jendker · March 12, 2020, 2:35pm

Update of kenlm does not help. I had the same issue and apparently what was changed with respect to v0.6.1 is that you need to provide -v argument to build_binary. The error Error: Can’t parse scorer file, invalid header. Try updating your scorer file. is not quite helpful here.

Now I get only Doesn't look like a character based model, but the package creation succeeds