Adding new sentences to the existing language model

I want to train DeepSpeech on domain-specific data and in order to get good results on inference I want to add new sentences to the existing language model.

According to the documentation, LibriSpeech normalized LM training text was used to create the present language model. After downloading the file librispeech-lm-norm.txt, I saw that the file contains sentences like

...
A A A A A BOVE SECOND SINGER DIMINUENDO
A A A A A MEN
A A A A A Y
A A A A AHOWOOH
A A A A ALL ABOARD
...

These sentences do not make any sense, can anyone please help me out in understanding the format of this data?

If I want to create a custom domain-specific language model or add sentences to the existing language model then can I add the sentences from my data directly or do I have to convert my data into some other format (as shown above) before creating a language model out of it?

You can directly add them. I’m not sure I understand why you think there is a specific format, it is properly documented in data/lm/ how we re-build.

@lissyx, why I thought that there was a specific format, is due to the sentences present in LibriSpeech’s normalized LM training text (a few examples of such sentences have been shown above).

If you read the corpus used for making the language model (i.e. LibriSpeech LM corpus as specified in data/lm/README), you’ll see that the sentences present there are not proper English sentences. This made me wonder why is the data present in such format.

This is just how they built their corpus, but KenLM does not require any specific format. So you can read that as is.

Thanks for the help.

I just have one more query. If we feed such kind of corpus where the sentences don’t make sense then wouldn’t that also create an improper language model?

That also depends on the amount. The language model built from this has improved quality a lot, but we are still working on improving it.

Thanks @lissyx for the help.

Is there any advantage in replacing the LibriSpeech normalized text in librispeech-lm-norm.txt with say the entire text from Wikipedia or similar exhaustive english text (where I want to build an english language model)? Are there any limitations on the size for this file or the language model?