How to make the words in LM.arpa more accuracy?

axcn · May 25, 2020, 9:47am

I found that the value of words in LM.arpa which are similar to each other, what should i do to make them more different to each other? Below as some partially data:

\1-grams:
-2.377632	<unk>	0
0	<s>	-0.048242666
-1.2715119	</s>	0
-2.161523	記	-0.04717682
-2.2475286	得	-0.04957858
-2.2475286	揳	-0.04957858
-2.2475286	返	-0.04957858
-2.2475286	件	-0.04957858
-2.2475286	衫	-0.04957858

\2-grams:
|-0.96653664|記 得|-0.12493875|
|-1.2292416|得 揳|0|
|-0.9471938|揳 返|0|
|-0.9471938|返 件|0|
|-0.9471938|件 衫|0|

\3-grams:
|-0.23821242|練 習 </s>|0|
|-0.23821242|浪 狗 </s>|0|
|-0.23821242|小 巴 </s>|0|
|-0.28311574|友 唔 記|-0.30103|
|-0.2564864|唔 記 得|-0.30103|

\4-grams:
-0.102974355	跑 練 習 </s>	0
-0.102974355	流 浪 狗 </s>	0
-0.102974355	緊 小 巴 </s>	0
-0.11888485	老 友 唔 記	-0.30103
-0.10957761	友 唔 記 得	-0.30103

\5-grams:
-0.048442308	步 跑 練 習 </s>
-0.048442308	隻 流 浪 狗 </s>
-0.048442308	等 緊 小 巴 </s>
-0.055387095	<s> 老 友 唔 記

Is it cause by too less data? Or i missed some steps?

python3 ./data/lm/generate_lm.py \
  --input_txt /mnt/deepspeechdata/simple/CV/zh-HK/vocabulary.txt \
  --output_dir /mnt/deepspeechdata/simple/lm/ \
  --top_k 10000 \
  --kenlm_bins /DeepSpeech/native_client/kenlm/build/bin/ \
  --arpa_order 5 \
  --max_arpa_memory "90%" \
  --arpa_prune "0|0|1" \
  --binary_a_bits 255 \
  --binary_q_bits 8 \
  --binary_type trie \
  --discount_fallback \

python3 ./data/lm/generate_package.py \
  --alphabet /mnt/deepspeechdata/simple/CV/zh-HK/alphabet.txt \
  --lm /mnt/deepspeechdata/simple/lm/lm.binary \
  --vocab /mnt/deepspeechdata/simple/lm/vocab-10000.txt \
  --package /mnt/deepspeechdata/simple/lm/kenlm.scorer \
  --force_utf8 True \
  --default_alpha 0.931289039105002 \
  --default_beta 1.1834137581510284 \

I used these two commands to generate the lm.binary and lm.scorer files.

reuben · May 25, 2020, 11:19am

If you’re using a vocabulary as input to the LM then yes, it could be related to lack of data. You should try with a better text source.

axcn · May 26, 2020, 8:50am

Thank you for your reply. Common Voice datasets for Cantonese is still just a little.

I am doing the proof of concept for Cantonese speech recognition. Only after completing it, our project team would purchase the datasets from the data source vendor.

What should I do to train one or two sentences with a good result and integrate it into our website? Is it recommend to follow as same as “./bin/run-ldc93s1.sh” to train only one sentence?

And i am integrating the sample “web_microphone_websocket” to my prototype. When I execute "finish Stream() ", the exception is occurs. Is there any different of setting between the pre-trained model and my own model?