"Doesn't look like a character based model"

sigma_g · May 27, 2020, 9:13am

Hi there! I created my first scorer package today, using the excellent generate_lm.py and the generate_package.py files provided in v0.7.0. However, I get the following message:

$ python generate_package.py --alphabet ../alphabet.txt --lm lm.binary --vocab vocab-50000.txt --package kenlm.scorer --default_alpha 1.234 --default_beta 1.012

73 unique words read from vocabulary file.
Doesn't look like a character based model.
Using detected UTF-8 mode: False
Package created in kenlm.scorer
swig/python detected a memory leak of type 'Alphabet *', no destructor found.

I generated the other files using:

$ python generate_lm.py --input_txt ~/path/to/in.transcript --output_dir . --kenlm_bins ~/path/to/kenlm/build/bin --arpa_order 4 --max_arpa_memory "90%" --binary_a_bits 255 --binary_q_bits 8 --binary_type trie --top_k 50000 --arpa_prune "0" --discount_fallback

Converting to lowercase and counting word occurrences ...
| |                   #                                                             | 200451 Elapsed Time: 0:00:01

Saving top 50000 words ...

Calculating word statistics ...
  Your text file has 919303 words in total
  It has 73 unique words
  Your top-50000 words are 100.0000 percent of all words
  Your most common word "takes" occurred 66816 times
  The least common word in your top-k is "abort" with 1 times
  The first word with 2 occurrences is "game" at place 69

Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /home/gt/otherrepos/DeepSpeech/data/lm/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 919303 types 76
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:912 2:1264540416 3:2371013376 4:3793621504
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 3: D1=0.5 D2=1 D3+=1.5
Statistics:
1 76 D1=0.5 D2=1 D3+=1.5
2 1263 D1=0.5 D2=1 D3+=1.5
3 19067 D1=0.5 D2=1 D3+=1.5
4 114691 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type      kB
probing 2494 assuming -p 1.5
probing 2613 assuming -r models -p 1.5
trie     749 without quantization
trie     315 assuming -q 8 -b 8 quantization 
trie     731 assuming -a 22 array pointer compression
trie     297 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:912 2:20208 3:381340 4:2752584
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:7446612 kB	VmRSS:9560 kB	RSSMax:1455416 kB	user:0.372354	sys:0.416396	CPU:0.788805	real:0.817813

Filtering ARPA file using vocabulary of top-k words ...
Reading ./lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Building lm.binary ...
Reading ./lm_filtered.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

Please let me know if you require more information. I generated the in.transcript using a Python file, which basically printed out all the possible variations using several nested for loops. file in.transcript prints ASCII TEXT.

I was following the Scorer guide. My vocabulary has english words with english characters. And they should fit in UTF8 encoding. So, why does the output say the following?

Doesn't look like a character based model.
Using detected UTF-8 mode: False

lissyx · May 27, 2020, 9:29am

Because it’s about something else, UTF-8 mode is when not using an alphabet. @reuben can elaborate, but it looks fine.

reuben · May 27, 2020, 10:22am

Just read the docs, they explain what the UTF-8 decoder mode is: https://deepspeech.readthedocs.io/en/master/Decoder.html

sigma_g · May 27, 2020, 10:48am

Thanks @reuben and @lissyx!