Hello @nmstoker,
I want to create my own scorer file. When executing the generate_lm.py script, I have this output:
CBuilding lm.binary …
./data/lm/lm.binary
Usage: ./kenlm/build/bin/build_binary [-u log10_unknown_probability] [-s] [-i] [-v] [-w mmap|after] [-p probing_multiplier] [-T trie_temporary] [-S trie_building_mem] [-q bits] [-b bits] [-a bits] [type] input.arpa [output.mmap]
-u sets the log10 probability for if the ARPA file does not have one.
Default is -100. The ARPA file will always take precedence.
-s allows models to be built even if they do not have and .
-i allows buggy models from IRSTLM by mapping positive log probability to 0.
-v disables inclusion of the vocabulary in the binary file.
-w mmap|after determines how writing is done.
mmap maps the binary file and writes to it. Default for trie.
after allocates anonymous memory, builds, and writes. Default for probing.
-r “order1.arpa order2 order3 order4” adds lower-order rest costs from these
model files. order1.arpa must be an ARPA file. All others may be ARPA or
the same data structure as being built. All files must have the same
vocabulary. For probing, the unigrams must be in the same order.
type is either probing or trie. Default is probing.
probing uses a probing hash table. It is the fastest but uses the most memory.
-p sets the space multiplier and must be >1.0. The default is 1.5.
trie is a straightforward trie with bit-level packing. It uses the least
memory and is still faster than SRI or IRST. Building the trie format uses an
on-disk sort to save memory.
-T is the temporary directory prefix. Default is the output file name.
-S determines memory use for sorting. Default is 80%. This is compatible
with GNU sort. The number is followed by a unit: % for percent of physical
memory, b for bytes, K for Kilobytes, M for megabytes, then G,T,P,E,Z,Y.
Default unit is K for Kilobytes.
-q turns quantization on and sets the number of bits (e.g. -q 8).
-b sets backoff quantization bits. Requires -q and defaults to that value.
-a compresses pointers using an array of offsets. The parameter is the
maximum number of bits encoded by the array. Memory is minimized subject
to the maximum, so pick 255 to minimize memory.
-h print this help message.
Get a memory estimate by passing an ARPA file without an output file name.
Traceback (most recent call last):
File “./data/lm/generate_lm.py”, line 211, in
main()
File “./data/lm/generate_lm.py”, line 202, in main
build_lm(args, data_lower, vocab_str)
File “./data/lm/generate_lm.py”, line 127, in build_lm
“./data/lm/lm.binary”,
File “/usr/lib/python3.6/subprocess.py”, line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ‘[’./kenlm/build/bin/build_binary’, ‘-a’, ‘255’, ‘-q’, ‘8’, ‘-v’, ‘tree’, ‘./data/lm/lm_filtered.arpa’, ‘./data/lm/lm.binary’]’ returned non-zero exit status 1.
My lm.arpa is created succesfully but it crashed when creating lm.binary. Do you have any idea about ?
Thank’s for your help