Hi there! I created my first scorer package today, using the excellent generate_lm.py and the generate_package.py files provided in v0.7.0. However, I get the following message:
$ python generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 1.234 --default_beta 1.012
73 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Package created in kenlm.scorer
swig/python detected a memory leak of type 'Alphabet *', no destructor found.
I generated the other files using:
$ python generate_lm.py --input_txt ~/path/to/in.transcript --output_dir . --kenlm_bins ~/path/to/kenlm/build/bin --arpa_order 4 --max_arpa_memory "90%" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --top_k 50000 --arpa_prune "0" --discount_fallback
Converting to lowercase and counting word occurrences ...
| | # | 200451 Elapsed Time: 0:00:01
Saving top 50000 words ...
Calculating word statistics ...
Your text file has 919303 words in total
It has 73 unique words
Your top-50000 words are 100.0000 percent of all words
Your most common word "takes" occurred 66816 times
The least common word in your top-k is "abort" with 1 times
The first word with 2 occurrences is "game" at place 69
Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /home/gt/otherrepos/DeepSpeech/data/lm/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 919303 types 76
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:912 2:1264540416 3:2371013376 4:3793621504
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 3: D1=0.5 D2=1 D3+=1.5
Statistics:
1 76 D1=0.5 D2=1 D3+=1.5
2 1263 D1=0.5 D2=1 D3+=1.5
3 19067 D1=0.5 D2=1 D3+=1.5
4 114691 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type kB
probing 2494 assuming -p 1.5
probing 2613 assuming -r models -p 1.5
trie 749 without quantization
trie 315 assuming -q 8 -b 8 quantization
trie 731 assuming -a 22 array pointer compression
trie 297 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz VmPeak:7446612 kB VmRSS:9560 kB RSSMax:1455416 kB user:0.372354 sys:0.416396 CPU:0.788805 real:0.817813
Filtering ARPA file using vocabulary of top-k words ...
Reading ./lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Building lm.binary ...
Reading ./lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS
Please let me know if you require more information. I generated the in.transcript
using a Python file, which basically printed out all the possible variations using several nested for loops. file in.transcript
prints ASCII TEXT
.
I was following the Scorer guide. My vocabulary has english words with english characters. And they should fit in UTF8 encoding. So, why does the output say the following?
Doesn't look like a character based model.
Using detected UTF-8 mode: False