Generate_scorer_package error creating language model

i181237 · November 17, 2020, 9:58pm

Hello,

I am working on creating a speech to text engine for Urdu (Pakistani national language). I am using DeepSpeech 0.9.1 and followed the instructions in the documentation.

I am running the following command when working with the language model:
./generate_scorer_package --alphabet /content/gdrive/MyDrive/dataset/Urdualphabet.txt --lm lm.binary --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --vocab vocab-500000.txt

The output I am getting is:
500000 unique words read from vocabulary file. Doesn't look like a character based (Bytes Are All You Need) model. --force_bytes_output_mode was not specified, using value infered from vocabulary contents: false Invalid label 0

Can someone please advise what I need to change? The vocabulary was generated successfully.

I tried with --force_bytes_output checked but the documentation says it is for a different purpose. Is there a flag for utf-8? I could not find anything referring to that in the documentation.

Thank you.

othiele · November 17, 2020, 10:16pm

Sounds a bit strange. DeepSpeech handles UTF-8 well, no need for a UTF-8 flag. Are you sure your files are allright?

And you are the second person tonight asking for Urdu. Why don’t you team up?

i181237 · November 18, 2020, 12:17am

Thanks for your quick response.

I checked the other person working on Urdu, she is working on Roman Urdu - a version of Urdu written with roman letters. I am working on the native Urdu font which is right to left and has a completely different data set. Could right-to-left be the reason?

I just pulled a sample Urdu data set from: http://pcai056.informatik.uni-leipzig.de/downloads/corpora/urd_newscrawl_2016_1M.tar.gz
Installed KenLM and it’s dependencies.
Then I used the command below to generate the vocabulary and lm binary file.
!python3 generate_lm.py --input_txt ../../../Urdu_Corpus/urd_newscrawl_2016_1M/urd_newscrawl_2016_1M-sentences.txt \ --output_dir . \ --top_k 500000 --kenlm_bins ../../../kenlm/build/bin/ \ --arpa_order 4 --max_arpa_memory "85%" --arpa_prune "0" \ --binary_a_bits 255 --binary_q_bits 8 --binary_type trie

I have regenerated the alphabet as well but it doesn’t seem to matter. What else can I change?

reuben · November 18, 2020, 8:39am

This error means you have a character in your data which does not appear in your alphabet file. In this case, the character 0

bitbarrel · September 17, 2021, 5:43am

You will also get this error if the path to the alphabet file is incorrect.