Generate LM

I am trying to generate custom LM but my text corpus contains a lot of blank lines and special characters. I know that the generate_lm.py script takes care of the uppercase letters but does the script or the kenlm binaries take care of the special characters and the blank lines?

It’s not possible for a script to handle this automatically for any input text, so you’ll have to do some clean up yourself. If you don’t remove special characters, they’ll be included in the language model, but won’t be added to the trie file as long as the special characters aren’t in your alphabet. So you’ll waste space (the LM will be bigger than it has to be) but it shouldn’t affect things too badly. It can have effects in some cases, for example if the text has a line line this:

“The quick brown fox jumps over the lazy dog”.

It’ll include "The and dog". as words in your LM, which is probably not what you’re looking for. Hopefully with enough data these things become irrelevant, but if you don’t have a lot of text data, it might be worth cleaning things up.

@reuben actually I am using the open web text corpus that you have mentioned here:

I tried to use the LM that you have given here but it has increased the inference time because right now I am using another custom text corpus which is smaller in size. So, I am trying to reduce the size of the open web text corpus to come at a point where I can come at a balance between accuracy and inference time. But this corpus, as you might know, is split into a lot of parts as text files into multiple folders and there are a lot of spaces and other characters

This might be useful: https://kheafield.com/code/kenlm/filter/