I am working on creating a speech to text engine for Urdu (Pakistani national language). I am using DeepSpeech 0.9.1 and followed the instructions in the documentation.
but when I ran the below command !./generate_scorer_package --alphabet /content/gdrive/MyDrive/dataset/UrduAlphabet_newscrawl2.txt --lm /gdrive/My\ Drive/urdu_lm/lm.binary UTF-8 model true \ --package /gdrive/My\ Drive/urdu_lm/kenlm.scorer --force_bytes_output_mode false --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --vocab /content/gdrive/My\ Drive/urdu_lm/vocab-500000.txt
i got the below errot and i can’t locate my scorer file also .
Can anyone please help
500000 unique words read from vocabulary file.
Doesn’t look like a character based (Bytes Are All You Need) model.
Error loading language model file: Could not read scorer file.
Could it be that the vocab-500000.txt ist the standard English file? This would have to be in Urdu of course. Please check all files you are using and try to understand what they do. Otherwise you won’t get a model you can use. The generate scorer is basically just running other scripts and programs, check the source to see what it does.
The error indicates that you are giving an unloadable lm.binary. Message is a bit misleading here, sorry:
(“lm”, po::value(), “Path of KenLM binary LM file. Must be built without including the vocabulary (use the -v flag). See generate_lm.py for how to create a binary LM.”)
For anyone else who runs into similar issues… The source file I was using to generate the language model was not formatted with only words I wanted the model to be trained on. I removed some special characters (tabs, numbers not part of the sentences) and the problem is resolved. Of course, it does not make 100% sense but that is what resolved the problem for me.
Thank you for your help.
You should have gotten a message that there are characters in your vocab that are not in the alphabet. And chars not in the alphabet can’t be recognized. It makes therefore sense not to have them in vocab.