Sure, I’ll try and explain my thought process as I was reading the page. The context when I was reading is the use case above - matching specific sequences of words (commands)
Section: Default mode (alphabet based)
Word based means the text corpus used to build the scorer should contain words separated by whitespace
Here I was thinking that in my alphabet, whitespace (specifically spaces) is relevant, so I don’t want to use it as a separator.
Section: UTF-8 mode
UTF-8 scorers are character based (more specifically, Unicode codepoint based), but the way they are used is similar to a word based scorer where each “word” is a sequence of UTF-8 bytes representing a single Unicode codepoint
Specifically, I am trying to match words. So my thinking is that my alphabet is the words I am trying to match on in the specific phrases, e.g.
From the language modeling perspective, this is a character based model. From the implementation perspective, this is a word based model, because each character is composed of multiple labels.
Again, this reinforced my belief that this was the correct approach.
Because KenLM uses spaces as a word separator, the resulting language model will not include space characters in it. If you wish to use UTF-8 mode but still model spaces, you need to replace spaces in the input corpus with a different character before converting it to space separated codepoints…
Hope this helps. Given that I have seen command matching in this forum a few times but couldn’t find an answer for this, could you please clarify the following?
Is it possible to build a scorer that would accept the following phrases:
- open the door
- close the door
- answer phone
- make coffee
But that would not match the phrase: