Vocabulary with 'placeholders' for a selection fo words

Hi, I have a question about the vocabulary. I want to create a scorer and need to specify some valid sentences. Some sentences for my model would be:

“add xxxxxx to telephone book”

x can be any digit between “zero” and “nine” as written out words.
The problem is, that I don’t want to add all the possible combinations of digits to the vocabulary. So I was wondering if it is possible to add something like a selection of possible words at any place of the sentence like:

add [zero|one|two|three|four|five|six|seven|eight|nine] to phone book

Thanks!

This is not something supported by KenLM, which more or less expects full strings. From experiments, having multiple variants like you need works quite well, so maybe you should just have your vocab.txt be generated and thus you express your sentences using alternatives like suggested.

I’m not sure this is something we could handle at generate_lm.py, maybe @reuben could share his opinion on that?

1 Like

@lissyx
Thanks for the clarification. I have two further questions:

  1. Would it be okay to exclude some phrases from the vocabulary like those mentioned?
  2. Is it okay if I add the numbers in words (zero to nine) as single words and not in sentences?

You really have to run your own experiments, each use case is kind of different.

1 Like