Generate LM

rajpuneet.sandhu · November 12, 2019, 8:24pm

I am trying to generate custom LM but my text corpus contains a lot of blank lines and special characters. I know that the generate_lm.py script takes care of the uppercase letters but does the script or the kenlm binaries take care of the special characters and the blank lines?

reuben · November 12, 2019, 8:34pm

It’s not possible for a script to handle this automatically for any input text, so you’ll have to do some clean up yourself. If you don’t remove special characters, they’ll be included in the language model, but won’t be added to the trie file as long as the special characters aren’t in your alphabet. So you’ll waste space (the LM will be bigger than it has to be) but it shouldn’t affect things too badly. It can have effects in some cases, for example if the text has a line line this:

“The quick brown fox jumps over the lazy dog”.

It’ll include "The and dog". as words in your LM, which is probably not what you’re looking for. Hopefully with enough data these things become irrelevant, but if you don’t have a lot of text data, it might be worth cleaning things up.

rajpuneet.sandhu · November 12, 2019, 8:46pm

@reuben actually I am using the open web text corpus that you have mentioned here:

I tried to use the LM that you have given here but it has increased the inference time because right now I am using another custom text corpus which is smaller in size. So, I am trying to reduce the size of the open web text corpus to come at a point where I can come at a balance between accuracy and inference time. But this corpus, as you might know, is split into a lot of parts as text files into multiple folders and there are a lot of spaces and other characters

reuben · November 12, 2019, 9:03pm

This might be useful: https://kheafield.com/code/kenlm/filter/

Topic		Replies	Views
Discuss potential PR related to generate_lm.py DeepSpeech	2	420	February 25, 2021
[Solved] Help with custom language model, output is gibberish DeepSpeech	4	557	April 9, 2020
KenLM LM vs trie DeepSpeech	7	2968	April 13, 2019
Issue when training custom model DeepSpeech	0	507	April 14, 2020
Can we use DeepSpeech for Vietnamese Speech To Text? DeepSpeech	38	7151	January 25, 2022

Generate LM

Related topics