Generated Scorer for 0.7 has a bad header

tarekeldeeb · May 16, 2020, 9:46am

I’m migrating from 0.6 to 0.7 following the docs. I have generated a new scorer with

python3 generate_package.py --alphabet ../my/my-alphabets.txt --lm ../my/lm.binary --vocab ../my/my-vocab.txt --package ../my/my_lm.scorer --default_alpha 1.5 --default_beta 1.85

Unfortunately, when trying to start training it fails with:

ValueError: Scorer initialization failed with error code 1

Investigating the failing scorer file, head command shows a git-lfs 3-lines header followed by binary chunks:

version https://git-lfs.github.com/spec/v1
oid sha256:94dc681c40e7731a82e9fbd7f6..d943a1d0411
size 1581036
EIRT▒?▒▒▒▒?▒~consstandard▒Z▒▒ ...

While showing the head of the default kenlm.scorer shows:

mmap lm http://kheafield.com/code format version 5
▒?▒▒▒▒▒▒?#▒4Ɯ, ..

How can this bad header be generated? How can I fix that?

othiele · May 16, 2020, 10:29am

Did you use the “old” lm.binary? Otherwise, generate a new one, more details in this thread

tarekeldeeb · May 16, 2020, 10:33am

Thanks … When I created a new lm.binary with the v option. The generated scorer now looks normal with head command.

But unfortunately, the training fails with the same error again.

File “/home/teldeeb/ai/dsq7/training/deepspeech_training/train.py”, line 891, in early_training_checks
FLAGS.scorer_path, Config.alphabet)
File “/opt/anaconda3/lib/python3.7/site-packages/ds_ctcdecoder/init.py”, line 36, in init
raise ValueError(‘Scorer initialization failed with error code {}’.format(err))
ValueError: Scorer initialization failed with error code 1

reuben · May 16, 2020, 10:33am

What @othiele said. You can reuse the old ARPA file and just re-run KenLM’s build_binary but passing the -v flag, then use generate_package.py.

reuben · May 16, 2020, 10:35am

tarekeldeeb:

Investigating the failing scorer file, head command shows a git-lfs 3-lines header followed by binary chunks:
version https://git-lfs.github.com/spec/v1
oid sha256:94dc681c40e7731a82e9fbd7f6..d943a1d0411
size 1581036
EIRT▒?▒▒▒▒?▒~consstandard▒Z▒▒ ...

This looks like you ran generate_package.py with an invalid LM (cloned without git-lfs being properly installed).

tarekeldeeb · May 16, 2020, 11:02am

Thanks, it’s now fixed after generating a new lm.binary and scorer. But the training fails to start with the scorer. Runs normally without the scorer, but this will not make an accurate model.

How can I get more verbose info about the failing scorer?

reuben · May 16, 2020, 11:06am

The scorer does not affect the training of the acoustic model itself, it is only used at the very last step, for the test epoch.

For now, only by editing the code or running it under a debugger, sadly.

othiele · May 16, 2020, 11:23am

Adding to @reuben, maybe different alphabet files or unknown characters? Start building the language model from scratch works fine for me. Otherwise create a very simple langauge model from scratch and check that this working.

tarekeldeeb · May 19, 2020, 8:32pm

Thanks @othiele and @reuben. Rebuilding all language files again seem to have resolved something. Training can now start.

I wish some verbose messages are spilled out on common errors.

Topic		Replies	Views
Generating own scorer file DeepSpeech	41	6911	November 14, 2020
ValueError: Scorer initialization failed with error code 8198 swig/python detected a memory leak of type 'Alphabet *', no destructor found DeepSpeech	29	2277	March 8, 2021
Error while generating own scorer DeepSpeech	5	678	November 27, 2020
Doesn't look like a character based (Bytes Are All You Need) model DeepSpeech	2	763	March 19, 2021
EnableExternalScorer failed with 'Invalid scorer file.' (0x2002) DeepSpeech	3	1273	March 25, 2021

Generated Scorer for 0.7 has a bad header

Related topics