Customizing language model

Hello,
I’d like to improve WER accuracy for my use cases by adding domain specific phrases and names (e.g. for a technical text, adding words and phrases like “Node.js”, “GPU”, “javascript”).

The rest of the text is in general English so I’d like to leverage the existing models.

The first thing I’d like to try is to use output_graph.pb as is and adapt the language model. So far, I have only seen the language model in it’s binary form in deepspeech repos but not in arpa format or even the original text file from which the model has been generated - are these available somewhere? Alternatively has anyone tried to extend the binary language model?

If anyone can see an easier way to customize the inference with a list of domain specific words/phrases, I’d appreciate your ideas.

3 Likes

Thanks! I think that the language model creation is supposed to be documented, but maybe we forgot to do that. In the meantime, Vincent shared with us his steps to produce a robot-dedicated speech recognition model that covers this: TUTORIAL : How I trained a specific french model to control my robot

I do think we have the vocabulary file in the repo, in data/lm/vocab.txt and you can re-generate trie and lm.binary files.

Regarding training on top of the current model to add more words, this is definitively something that should be working, the only limitation is that with the current output_graph.pb file, you cannot do that ; this file is a frozen model, so it cannot be used for training. Also, you would need the checkpoints to re-continue training. We are looking into a dot release soon, that would likely include that, so you should be able to do that soon, I hope :slight_smile:

Hi, yv001

Well, Deepspeech team, creating their model, worked on lot of wav’s/sentences,
So, each US alphabet letter is very well learnt ! (with alphabet)

Now, we must work with LM/trie, for better WER

Here, you’d like to ‘improve’ your specific words, but the first question is :
Do the model KNOWS the words you want ?!!

Why not insert in vocabulary, anything like :

  • start a javascript program
  • launch jacascript application
  • what is javascript
  • stop javascript app

Try to complete all wanted/possible asks…

Then build a new LM and Trie files.

Sure that you should have quite better WER, regarding to all sequences you could ask about javascript !

Hope it could help !

thanks both for the replies,
@lissyx I’ve missed vocab.txt in repo so thanks for pointing me towards it. That’s the one I’ll try to use for the extended language model building. Checkpoints of the model would indeed be very useful for the transfer learning so I’ll be looking forward to the next release.

@elpimous_robot Thanks for the suggestion, that’s exactly where I was heading when looking for the vocabulary file so that I could append the domain specific phrases to it and hopefully improve WER on previously unrecognized words. I’ve used this approach today (using your tutorial so thanks again) to build new lm and trie so I’ll run some tests to see if changing just the language model will be enough or if I’ll need to retrain the tensorflow model too (hopefully not).

i’ve tried to recreate the data/lm/lm.binary model using
data/lm/vocab.txt and
lmplz --text vocab.txt --arpa lm.arpa --o 5 -S 50%

and then transform it into binary format with

build_binary -T -s lm.arpa lm.binary

but the created lm.binary is only 125MB as opposed to 1.5GB in released deepspeech model
was a different vocab used to create the model or were the lmplz params different?

I think that the Deepspeech US model is frozen …you can’t modify it (for now)
Or rebuild it entirely from scratch (good luck, LOL)

What did you change in vocab.txt ? Did you just ADD your sentences at the end of existing ones ?
Strange !
did you try inferences with your new LM, trie, and sentences ? Did it work ?

I just added my phrases and their combinations to the end of vocab.txt and created a kenlm model using the commands above (order of the model I used was 5). It improved the recognition of my phrases a little bit but it’s not perfect still.

I’ll need to run the evaluation on more data to be sure how much improvement the updated language model provides and how many extra sentences in vocab.txt are needed to increase the probabilities of my phrases sufficiently.

Alternative to generating a lot of sentences for vocab.txt would be tweaking the probabilities in arpa directly but I haven’t read the format spec yet so don’t know if there are constraints e.g. sum of all probabilities should equal to 1 or something like that.

Once I know more, I’ll post it here.

In the meantime, if there was a way in deepspeech to inject high probability phrases to the recognition call directly, that would resolve the problem much more elegantly. @lissyx are there any plans for this in the roadmap?

I don’t think it’s something we discussed yet, but if you want to experiment you are welcome :).

yv001,
did you recreate a trie file too ?
In client.py, for exemple, you can call the trie file to help probabs.

Yup, I’ve rebuilt trie as well using the native client script so that shouldn’t be the issue.

Changing the parameters LM_WEIGHT and VALID_WORD_COUNT_WEIGHT (in node.js bindings) influences the result a lot but modifying them before I have a lot of training data available would be a lottery.

so just to summarize result of irc conversations for others that would be interested in extension of the existing language model:

  • lm.binary from the release (1.5GB version) was generated from an unpublished text file (source licenses prohibit from publishing the full text file that was used)
  • if new data can be merged into existing lm.binary is unclear at the moment, will need to contact kenlm makers
1 Like

Hi, Maybe I missed this, but on what corpus did you train the language model? Was it data from the train in Librispeech, internet text, spoken text? If you tried several different types of corpora, did you see significant differences on Librispeech?

It was a combination of Librispeech train, Fisher train, Switchboard train, plus some other sources.

We are currently[1], for release 0.2.0, creating and training on a corpus that is licensed for public release.