Building LM, noticed vocab.txt and librispeech-lm-norm.txt have a lot of low quality words

Thanks for all the work going into this amazing project! :slight_smile:

I’ve been looking at building a custom LM, following https://github.com/mozilla/DeepSpeech/tree/master/data/lm and have managed fairly well so far using my own data (ie to be the equivalent of the text input into lmplz and as the vocab.txt for generate_trie)

What I noticed is that there a lot of low quality words and sentences in librispeech-lm-norm.txt (which presumably end up in vocab.txt). I see in master that vocab.txt is no longer used, so that part won’t be a concern going forward, but the language model seems like it’ll contain a huge number of odd words plus presumably the sequences from some of the weirder sentences will throw it off too. And in a few cases I see non-English sentences in there too (looked like Dutch and German).

In June, there was mention here: LM + TRIE performance about new material for the language model being worked on. Has that updated material been used in the lm.binary that’s being distributed?

I ask because when using the distributed language model, whilst it often helps, occasionally it throws out very weird words, and I suspect that may be at least partially explained by the text quality.

I’d offer to help clean it, but the size makes that impractical (it’s about 40m lines!) Maybe if the new material is from a clean source and simply hasn’t been released yet then this will be a problem that goes away, but would be handy to know a bit on the status (if you can share any details yet?) Thank you!

I’m currently experimenting with new language models with a limited vocabulary, the 10k, 20k, 30k, 40k, or 50k most common words from librispeech-lm-norm.txt.

Using this limited vocabulary should throw out rare words in librispeech-lm-norm.txt that appear only once or twice and thus address this problem. But we have to run the benchmarks to be sure.

2 Likes

@kdavis do you have any updates on the creating an LM with limited vocabulary, that would improve quality of recognition?

Not really. We have a company wide meeting this week. So I’ve not had much time to actually work :slight_smile: I’ll know more next week.