I found that the value of words in LM.arpa which are similar to each other, what should i do to make them more different to each other? Below as some partially data:
\1-grams:
-2.377632 <unk> 0
0 <s> -0.048242666
-1.2715119 </s> 0
-2.161523 記 -0.04717682
-2.2475286 得 -0.04957858
-2.2475286 揳 -0.04957858
-2.2475286 返 -0.04957858
-2.2475286 件 -0.04957858
-2.2475286 衫 -0.04957858
\2-grams:
|-0.96653664|記 得|-0.12493875|
|-1.2292416|得 揳|0|
|-0.9471938|揳 返|0|
|-0.9471938|返 件|0|
|-0.9471938|件 衫|0|
\3-grams:
|-0.23821242|練 習 </s>|0|
|-0.23821242|浪 狗 </s>|0|
|-0.23821242|小 巴 </s>|0|
|-0.28311574|友 唔 記|-0.30103|
|-0.2564864|唔 記 得|-0.30103|
\4-grams:
-0.102974355 跑 練 習 </s> 0
-0.102974355 流 浪 狗 </s> 0
-0.102974355 緊 小 巴 </s> 0
-0.11888485 老 友 唔 記 -0.30103
-0.10957761 友 唔 記 得 -0.30103
\5-grams:
-0.048442308 步 跑 練 習 </s>
-0.048442308 隻 流 浪 狗 </s>
-0.048442308 等 緊 小 巴 </s>
-0.055387095 <s> 老 友 唔 記
Is it cause by too less data? Or i missed some steps?
python3 ./data/lm/generate_lm.py \
--input_txt /mnt/deepspeechdata/simple/CV/zh-HK/vocabulary.txt \
--output_dir /mnt/deepspeechdata/simple/lm/ \
--top_k 10000 \
--kenlm_bins /DeepSpeech/native_client/kenlm/build/bin/ \
--arpa_order 5 \
--max_arpa_memory "90%" \
--arpa_prune "0|0|1" \
--binary_a_bits 255 \
--binary_q_bits 8 \
--binary_type trie \
--discount_fallback \
python3 ./data/lm/generate_package.py \
--alphabet /mnt/deepspeechdata/simple/CV/zh-HK/alphabet.txt \
--lm /mnt/deepspeechdata/simple/lm/lm.binary \
--vocab /mnt/deepspeechdata/simple/lm/vocab-10000.txt \
--package /mnt/deepspeechdata/simple/lm/kenlm.scorer \
--force_utf8 True \
--default_alpha 0.931289039105002 \
--default_beta 1.1834137581510284 \
I used these two commands to generate the lm.binary and lm.scorer files.