Hi, I am experimenting with a small custom lm which mostly has digits combination (all digit combinations should be recognized) and few set of non-digit words sentences. (e.g. “all is good”). Both types never occurring together in a sentence. now, lm binaries and trie generated by this vocabulary works fine for non-digits sentences with default tflite model provided for v0.5.1. For digit combinations, I observed that sequences occurring in vocabulary are recognized with high probability, compared to digit sentences not in vocabulary (e.g. “five seven five nine”). Am I missing something here?
Sharing arpa and corresponding lm binary file and trie file.
all_combinations.zip (93.8 KB)
~/terminal/kenlm/build/bin/lmplz --text vocabulary.txt --arpa words.arpa --order 5 --discount_fallback --temp_prefix /tmp/
~/terminal/kenlm/build/bin/build_binary -T -s trie words.arpa lm.binary
~/terminal/repository/DeepSpeech/generate_trie alphabet.txt lm.binary trie