Generating language model for a small vocabulary

Kirill · August 28, 2020, 4:14pm

Hello,
I want to generate a language model for only ~30 words

gist.github.com

https://gist.github.com/Zhdanovich/69b68f66b0a7dc105e830e480a9df188

gistfile1.txt

start
repeat
undo
save
one
two
three
four
five
six

This file has been truncated. show original

Is it ok to do if I want my model to only understand these words?

Second, when I follow 0.8.2 model building guide with this file in the end I get empty output for a while and then Segmentation fault: 1. I use a streaming model with custom scorer, if use pretrained scorer everything works fine.

Here is how I generate scorer:

python generate_lm.py --input_txt lm.txt --output_dir lm --top_k 30 --kenlm_bins /Users/kirill/Developer/kenlm/build/bin --arpa_order 2 --max_arpa_memory "85%" --arpa_prune "0" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback
./native_client.amd64.cpu.osx/generate_scorer_package --alphabet ../lm/alphabet.txt --lm lm/lm.binary --vocab lm/vocab-40.txt --package /Users/kirill/Developer/kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284

lissyx · August 28, 2020, 4:43pm

it should

hard to actionate without more context here …

have you verified the output file? size?

is the alphabet the same everywhere?

can we see the output of your generation steps?

this is something that works well into many places, so it’s likely a mistake on your side.

Kirill · August 31, 2020, 12:55pm

Thanks for help. Turns out I was pointing to vocabulary file that didn’t exist