Generated Scorer for 0.7 has a bad header

I’m migrating from 0.6 to 0.7 following the docs. I have generated a new scorer with

python3 generate_package.py --alphabet ../my/my-alphabets.txt --lm ../my/lm.binary --vocab ../my/my-vocab.txt --package ../my/my_lm.scorer --default_alpha 1.5 --default_beta 1.85

Unfortunately, when trying to start training it fails with:

ValueError: Scorer initialization failed with error code 1

Investigating the failing scorer file, head command shows a git-lfs 3-lines header followed by binary chunks:

version https://git-lfs.github.com/spec/v1
oid sha256:94dc681c40e7731a82e9fbd7f6..d943a1d0411
size 1581036
EIRT▒?▒▒▒▒?▒~consstandard▒Z▒▒ ...

While showing the head of the default kenlm.scorer shows:

mmap lm http://kheafield.com/code format version 5
▒?▒▒▒▒▒▒?#▒4Ɯ, ..

How can this bad header be generated? How can I fix that?

Did you use the “old” lm.binary? Otherwise, generate a new one, more details in this thread

1 Like

Thanks … When I created a new lm.binary with the v option. The generated scorer now looks normal with head command.

But unfortunately, the training fails with the same error again.

File “/home/teldeeb/ai/dsq7/training/deepspeech_training/train.py”, line 891, in early_training_checks
FLAGS.scorer_path, Config.alphabet)
File “/opt/anaconda3/lib/python3.7/site-packages/ds_ctcdecoder/init.py”, line 36, in init
raise ValueError(‘Scorer initialization failed with error code {}’.format(err))
ValueError: Scorer initialization failed with error code 1

What @othiele said. You can reuse the old ARPA file and just re-run KenLM’s build_binary but passing the -v flag, then use generate_package.py.

1 Like

This looks like you ran generate_package.py with an invalid LM (cloned without git-lfs being properly installed).

Thanks, it’s now fixed after generating a new lm.binary and scorer. But the training fails to start with the scorer. Runs normally without the scorer, but this will not make an accurate model.

How can I get more verbose info about the failing scorer?

The scorer does not affect the training of the acoustic model itself, it is only used at the very last step, for the test epoch.

For now, only by editing the code or running it under a debugger, sadly.

Adding to @reuben, maybe different alphabet files or unknown characters? Start building the language model from scratch works fine for me. Otherwise create a very simple langauge model from scratch and check that this working.

Thanks @othiele and @reuben. Rebuilding all language files again seem to have resolved something. Training can now start.

I wish some verbose messages are spilled out on common errors.