Generating language model for a small vocabulary

Hello,
I want to generate a language model for only ~30 words

Is it ok to do if I want my model to only understand these words?

Second, when I follow 0.8.2 model building guide with this file in the end I get empty output for a while and then Segmentation fault: 1. I use a streaming model with custom scorer, if use pretrained scorer everything works fine.

Here is how I generate scorer:

python generate_lm.py --input_txt lm.txt --output_dir lm --top_k 30 --kenlm_bins /Users/kirill/Developer/kenlm/build/bin --arpa_order 2 --max_arpa_memory "85%" --arpa_prune "0" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback
./native_client.amd64.cpu.osx/generate_scorer_package --alphabet ../lm/alphabet.txt --lm lm/lm.binary --vocab lm/vocab-40.txt --package /Users/kirill/Developer/kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

it should

hard to actionate without more context here …

have you verified the output file? size?

is the alphabet the same everywhere?

can we see the output of your generation steps?

this is something that works well into many places, so it’s likely a mistake on your side.

Thanks for help. Turns out I was pointing to vocabulary file that didn’t exist

1 Like