Limited language model in noisy environment

Hello all,

I would like to raise a question regarding the possibility to customize the language model (and the reasoning for that) when using for voice recognition. I would like to create a simple application to control a machine with a small “context” of commands (cca 100 words in total, put into sentences with a defined structure like "Hey machine, {set|get} the {voltage|air pressure|whatever} to ") and would like to evaluate the performance in a noisy environment. This all shall run on Raspberry Pi 4, so it shall be reasonably HW demanding.

First, I would like to double-check that one of my assumptions is correct: Limiting the vocabulary and language model shall have positive impact on the performance also in noisy environment, correct? Since I’m basically reducing the size of the set of words/N-grams that can occur, I’m increasing the probability that the correct word combination will be recognized, even in an noisier environment?

In general, my questions are:

  1. Is there a common way to create such a custom limited vocabulary and language model while using pretrained models for other components? I.e. I don’t need to have for example a set of audio recordings?
  2. Is there a way how to configure the system into following configuration? Say that I would have the “set” and “get” commands for different parameters and I would like to define the probabilities to {0.1, 0.1, 0.8} for {“set”, “get”, }? Putting the there to avoid false detection of “set” or “get” when there would be a completely different word? Or is there a better way to handle this?
  3. In my scenario, what other stuff could I tune to increase the noise resistance?

Thank you in advance for your help


Correct, for the language model you just need text, no audio.

You’d have to build a text corpus that matches these statistics. The easiest way to start is simply to collect a corpus of real commands and build an LM with that, and see how it performs. You can then make adjustments. Building LMs is a quick process, you can easily experiment.

You could try fine tuning the model on audio that matches your use case more closely, if you have access to that. Although it’s certainly a more difficult process than creating language models.

Thank you very much for your prompt replies, Reuben!