Hi, I have a question about the vocabulary. I want to create a scorer and need to specify some valid sentences. Some sentences for my model would be:
“add xxxxxx to telephone book”
x can be any digit between “zero” and “nine” as written out words.
The problem is, that I don’t want to add all the possible combinations of digits to the vocabulary. So I was wondering if it is possible to add something like a selection of possible words at any place of the sentence like:
add [zero|one|two|three|four|five|six|seven|eight|nine] to phone book
Thanks!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
This is not something supported by KenLM, which more or less expects full strings. From experiments, having multiple variants like you need works quite well, so maybe you should just have your vocab.txt be generated and thus you express your sentences using alternatives like suggested.
I’m not sure this is something we could handle at generate_lm.py, maybe @reuben could share his opinion on that?