Hello everyone !
I’m trying to add more sentences to the language model. and to do this, i downloaded a wiki dump, preprocessed and cleaned it to be one sentence per line and also removed all sentences with numbers.
I am doing this as i find myself with a larger error for WER than CER in evaluation.
Here are some details;
DeepSpeech 0.7.4
Ubuntu 20.04
Python 3.7
tensorflow 1.15
Below is what i used to create binaries which were successfully created.
python3 generate_lm.py --input_txt /media/kamla/data/Voice_dataset/language_model_stuff/corpus/librispeech.txt --output_dir /media/kamla/847636CE7636C0AA/Users/offic/Documents/Kenlm --top_k 50000 --kenlm_bins /media/kamla/data/Voice_dataset/language_model_stuff/kenlm/build/bin --arpa_order 5 --max_arpa_memory "85%" --arpa_prune "0|0|1" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie
Now when i try to use generate_package.py like below;
python3 generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284
i get the following error;
50000 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Traceback (most recent call last):
File "generate_package.py", line 157, in <module>
main()
File "generate_package.py", line 152, in main
args.default_beta,
File "generate_package.py", line 48, in create_bundle
alphabet = NativeAlphabet()
TypeError: __init__() missing 1 required positional argument: 'config_path'