I am working on creating a speech to text engine for Urdu (Pakistani national language). I am using DeepSpeech 0.9.1 and followed the instructions in the documentation.
I am running the following command when working with the language model: ./generate_scorer_package --alphabet /content/gdrive/MyDrive/dataset/Urdualphabet.txt --lm lm.binary --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --vocab vocab-500000.txt
The output I am getting is: 500000 unique words read from vocabulary file. Doesn't look like a character based (Bytes Are All You Need) model. --force_bytes_output_mode was not specified, using value infered from vocabulary contents: false Invalid label 0
Can someone please advise what I need to change? The vocabulary was generated successfully.
I tried with --force_bytes_output checked but the documentation says it is for a different purpose. Is there a flag for utf-8? I could not find anything referring to that in the documentation.
I checked the other person working on Urdu, she is working on Roman Urdu - a version of Urdu written with roman letters. I am working on the native Urdu font which is right to left and has a completely different data set. Could right-to-left be the reason?
I just pulled a sample Urdu data set from: http://pcai056.informatik.uni-leipzig.de/downloads/corpora/urd_newscrawl_2016_1M.tar.gz
Installed KenLM and it’s dependencies.
Then I used the command below to generate the vocabulary and lm binary file. !python3 generate_lm.py --input_txt ../../../Urdu_Corpus/urd_newscrawl_2016_1M/urd_newscrawl_2016_1M-sentences.txt \ --output_dir . \ --top_k 500000 --kenlm_bins ../../../kenlm/build/bin/ \ --arpa_order 4 --max_arpa_memory "85%" --arpa_prune "0" \ --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
I have regenerated the alphabet as well but it doesn’t seem to matter. What else can I change?