Fine tuning the language model

Hi,
The current output of speech recognition has a few really long sequence of characters. I know that this is a known bug that is currently being worked on.
I do also know that the issue has been identified to be due to some words not being in the vocabulary. If I have a few transcripts representing my text, what is the best way to pass this information to the system?

  1. Building an lm.binary and a trie out using (my texts + common voice texts + other text from the internet (e.g. wikipedia))
  2. Spell correct the output of the transcription using a spell correction algorithm.
    (Any recommendations for libraries that do spell correction well. Underneath this is yet another language model, so,I am guessing that the right thing to do is to fix the original language model)
    I have a feeling that the acoustic model is working well and is not to blame here. If I read some parts of the transcripts which are particularly bad, the text there does sound like the true word being spoken.

You should be able to go down that road, take a look into data/lm/README.md we document how to reproduce the shipped lm.binary, and you should then be able to augment the generated text file from LibriSpeech with your own data :slight_smile:

@lissyx Thanks for the response.

Any pointers on (2). I’d like to see if there are any spell correction routes that I can try.
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/ talks about this approach
“We decided to work around the problem by building something like a spell checker instead: go through the transcription and see if there are any small modifications we can make that increase the likelihood of that transcription being valid English, according to the language model.” - Any idea what was used here?

Have all the other issues related to the spacing errors (other than those related to language model) been fixed?

Thanks,
Srikar

This is a language model as well.

I don’t think there are other issues except those related to the language model …