Adding new sentences to the existing language model

shan18 · December 2, 2019, 10:28am

I want to train DeepSpeech on domain-specific data and in order to get good results on inference I want to add new sentences to the existing language model.

According to the documentation, LibriSpeech normalized LM training text was used to create the present language model. After downloading the file librispeech-lm-norm.txt, I saw that the file contains sentences like

...
A A A A A BOVE SECOND SINGER DIMINUENDO
A A A A A MEN
A A A A A Y
A A A A AHOWOOH
A A A A ALL ABOARD
...

These sentences do not make any sense, can anyone please help me out in understanding the format of this data?

If I want to create a custom domain-specific language model or add sentences to the existing language model then can I add the sentences from my data directly or do I have to convert my data into some other format (as shown above) before creating a language model out of it?

lissyx · December 2, 2019, 12:43pm

You can directly add them. I’m not sure I understand why you think there is a specific format, it is properly documented in data/lm/ how we re-build.

shan18 · December 2, 2019, 12:58pm

@lissyx, why I thought that there was a specific format, is due to the sentences present in LibriSpeech’s normalized LM training text (a few examples of such sentences have been shown above).

If you read the corpus used for making the language model (i.e. LibriSpeech LM corpus as specified in data/lm/README), you’ll see that the sentences present there are not proper English sentences. This made me wonder why is the data present in such format.

lissyx · December 2, 2019, 1:33pm

This is just how they built their corpus, but KenLM does not require any specific format. So you can read that as is.

shan18 · December 2, 2019, 4:54pm

Thanks for the help.

I just have one more query. If we feed such kind of corpus where the sentences don’t make sense then wouldn’t that also create an improper language model?

lissyx · December 2, 2019, 4:57pm

That also depends on the amount. The language model built from this has improved quality a lot, but we are still working on improving it.

shan18 · December 2, 2019, 5:53pm

Thanks @lissyx for the help.

A_N · May 8, 2020, 6:08pm

Is there any advantage in replacing the LibriSpeech normalized text in librispeech-lm-norm.txt with say the entire text from Wikipedia or similar exhaustive english text (where I want to build an english language model)? Are there any limitations on the size for this file or the language model?

Topic		Replies	Views
Customizing language model DeepSpeech	13	8586	February 27, 2018
Building the language model for fine tuning a model (transfer learning) DeepSpeech	1	773	November 28, 2018
Where is Vocab.txt file? DeepSpeech	10	2443	April 5, 2019
Adding in a custom language model (like BERT) DeepSpeech	1	1961	March 19, 2019
Have some domain specific vocabulary. What would be the best thing to do? DeepSpeech	2	393	October 21, 2019

Adding new sentences to the existing language model

Related topics