Dictation verses transcription task language models

I am creating a dictation service for Linux desktops. While general text recognition is quite good, and i can type into almost every application, the language model is extremely bad at picking up the incredibly common punctuation and capitalization commands which a user dictates in order to produce clean output.

That is when the user wants to type a capitalized word, they would say something along the lines of caps ${word} or title ${word} or upper ${word} and the language models should match the meta command with a much higher priority than typing out the word caps

In a previous system i attempted to address this by producing a corpora by inferring from final formatting what the user likely would have said in order to produce the final text. However, that seems like it would produce any enormous duplication of effort in the deep speech project, as the project is already processing a very large corpora and producing a reasonable language model from it.

My current plan is to attempt to gang together a bunch of language models so that there will be a cascading series of language models and rules through which a given utterance is processed. That is a very high priority language model might just recognize punctuation and capitalization commands and after processing the phonemes pass any unallocated phonemes unto the next model.

That all seems rather complex and fragile, however. Is there a more obvious solution? For instance, would it be possible to do a run-time modification of the language model in order to introduce extremely high frequency or high likelihood words?

As my eventual goal is to allow for coding with the voice dictation service, there will likely be many different language models (a few base models for common situations and then at least one for each programming language). It will also be necessary for users to define their own words through the interface, of course.

On top of that i would like to allow for something like “momentary contexts” that take, for instance, the auto completion results from an IDE and use that to create a very high priority language overlay.

Again, i am really asking if there is a more obvious way to do all this?

I don’t follow your logic. Producing a good corpus that captures the mode switches is your best bet if you’re looking to avoid complexity and fragility. There’s no duplication of effort because the DeepSpeech project is building a speech recognition engine, we don’t know your use case or have your data. Layering tons of heuristics with multiple language models and dynamically boosting probability of certain terms at runtime is going to be very ugly code.

However, there is one case where you can’t really avoid the dynamic nature of things which is contextual biasing, as you only know the data at all during runtime. We don’t currently have a very good solution for this, specially for mixed commands and dictation. If you’d like to take a crack at contributing such a feature to DeepSpeech, I can point you in the right direction.