I think what @mischmerz is trying to do is to target the inference of the released model to only a few words and is hoping that there is a method to extract the part of the acoustic/language models that are only relevant to the chosen words. (perhaps not enough audio data samples for acoustic model training are available?)
AFAIK, acoustic model cannot be targeted but potentially a new language model that would incorporate only the chosen words could be used for the decoding part. This does not require audio data training, just some representative text containing the words you need. If people start saying things outside your language model, they are still likely to be interpreted as the language model words.
You can see an example of how to setup a new language model in this topic https://discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot/22830/44 .
This method won’t be faster on the acoustic part but the decoding should be faster a bit.