Error while generating own scorer

i181237 · November 21, 2020, 9:41pm

I am working on creating a speech to text engine for Urdu (Pakistani national language). I am using DeepSpeech 0.9.1 and followed the instructions in the documentation.

but when I ran the below command
!./generate_scorer_package --alphabet /content/gdrive/MyDrive/dataset/UrduAlphabet_newscrawl2.txt --lm /gdrive/My\ Drive/urdu_lm/lm.binary UTF-8 model true \ --package /gdrive/My\ Drive/urdu_lm/kenlm.scorer --force_bytes_output_mode false --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --vocab /content/gdrive/My\ Drive/urdu_lm/vocab-500000.txt
i got the below errot and i can’t locate my scorer file also .
Can anyone please help
500000 unique words read from vocabulary file.
Doesn’t look like a character based (Bytes Are All You Need) model.
Error loading language model file: Could not read scorer file.

othiele · November 22, 2020, 10:12am

Could it be that the vocab-500000.txt ist the standard English file? This would have to be in Urdu of course. Please check all files you are using and try to understand what they do. Otherwise you won’t get a model you can use. The generate scorer is basically just running other scripts and programs, check the source to see what it does.

i181237 · November 23, 2020, 12:26am

my vocab-500000 is in urdu.
I want to know whether we are generating kenlm.score or whther we are loading it from some where?

–package urdu_lm/kenlm.score

I’m still getting this error.

500000 unique words read from vocabulary file.

Doesn’t look like a character based (Bytes Are All You Need) model.
Error loading language model file: Could not read scorer file.

othiele · November 23, 2020, 7:51am

Please start reading the docs here. And format your error msgs.

The error indicates that you are giving an unloadable lm.binary. Message is a bit misleading here, sorry:

(“lm”, po::value(), “Path of KenLM binary LM file. Must be built without including the vocabulary (use the -v flag). See generate_lm.py for how to create a binary LM.”)

i181237 · November 26, 2020, 10:16pm

For anyone else who runs into similar issues… The source file I was using to generate the language model was not formatted with only words I wanted the model to be trained on. I removed some special characters (tabs, numbers not part of the sentences) and the problem is resolved. Of course, it does not make 100% sense but that is what resolved the problem for me.
Thank you for your help.

othiele · November 27, 2020, 8:09am

You should have gotten a message that there are characters in your vocab that are not in the alphabet. And chars not in the alphabet can’t be recognized. It makes therefore sense not to have them in vocab.

Topic		Replies	Views
Generate_scorer_package error creating language model DeepSpeech	4	1045	September 17, 2021
Generating own scorer file DeepSpeech	41	6932	November 14, 2020
Doesn't look like a character based (Bytes Are All You Need) model DeepSpeech	2	766	March 19, 2021
"Doesn't look like a character based model" DeepSpeech	3	1024	May 27, 2020
Issue while generating scorer for Urdu language DeepSpeech participation , learning , issue	0	463	October 5, 2021

Error while generating own scorer

Related topics