Thanks for all the work going into this amazing project!
I’ve been looking at building a custom LM, following https://github.com/mozilla/DeepSpeech/tree/master/data/lm and have managed fairly well so far using my own data (ie to be the equivalent of the text input into lmplz and as the vocab.txt for generate_trie)
What I noticed is that there a lot of low quality words and sentences in librispeech-lm-norm.txt (which presumably end up in vocab.txt). I see in master that vocab.txt is no longer used, so that part won’t be a concern going forward, but the language model seems like it’ll contain a huge number of odd words plus presumably the sequences from some of the weirder sentences will throw it off too. And in a few cases I see non-English sentences in there too (looked like Dutch and German).
In June, there was mention here: LM + TRIE performance about new material for the language model being worked on. Has that updated material been used in the lm.binary that’s being distributed?
I ask because when using the distributed language model, whilst it often helps, occasionally it throws out very weird words, and I suspect that may be at least partially explained by the text quality.
I’d offer to help clean it, but the size makes that impractical (it’s about 40m lines!) Maybe if the new material is from a clean source and simply hasn’t been released yet then this will be a problem that goes away, but would be handy to know a bit on the status (if you can share any details yet?) Thank you!