Adding custom words to language model

So I’d like to extend the language model with custom words that do not appear in the default one or only appear rarely - e.g. imagine “mozilla”, “google”, “ibm” appears in many of the samples I’d like to transcribe.

The question is, how much text I’d need to add on top of the texts used for the deepspeech default language model.

I thought the following method might work, but I’d like to see if someone has a better way of doing that:

  1. Collect a sample text with the custom words and estimate probability of the custom words in the sample text e.g.
    sample_mozilla_probability = #mozilla / #sample_words
  2. I would like to get the same probability in the final lm training set (final = deepspeech + new text) for my custom words as I have in my custom training set so I calculate the number of occurences in the new text as

mozilla_needed_count = sample_mozilla_probability * #words_deepspeech_text

  1. So to get my final language model training set, I can use my sample phrases containing mozilla and repeat them enough times to get mozilla_needed_count and append to the original deepspeech text.

I ignore the fact that word mozilla can already be in the deepspeech text but account for that should be as easy as counting #mozilla in deepseech text and subtract it from the mozilla_needed_count.

Does that sound reasonable?

Have you tried just adding those words into the existing language model, as a baseline ?

How is that done? I’d like to add “zoom” to the US English model as it never recognizes it!

Have you read the docs about building the scorer ?

Briefly, it looks like a lengthy process and I wanted to know if there was an alternative. Did you try getting it to recognize “zoom in”, it’s giving “some in”.

Have you tried the new word boost feature of the API ?

No, I hadn’t found that yet, sounds interesting, is there a guidance page or document?

It’s part of the API, e.g., for C it’s https://deepspeech.readthedocs.io/en/v0.9.3/C-API.html#_CPPv413DS_AddHotWordP10ModelStatePKcf

Looks good, so that would allow me to move “zoom” up above “some” in the selection process? I’ve only just started today, so haven’t yet found the guidance for using the c++ or c binding with MS Visual Studio, hence I’m using Python, so I’ll add addHotWord ( word after the create call, but the boost float has no info, is 1.0 normal, and the max?

Is there any info on using addHotWord, float values? Overall explanation? Can it be used to disable words? I tried to add “node” but all it gives is “note”. Setting note to .1 doesn’t help… :confused:

no, you have to perform your own experiments for your own application.

Okay, I seem to have found useful values now. DeepSpeech seems really excellent, hope you can get a team together to take it forward after all the layoffs.

Hi again, I’m finding that non-typical names, such as lunar craters, are not recognized even when boosted by AddHotWord. I searched “adding those words into the existing language model, as a baseline” but there are 0 references on the search engines. I don’t have the time to go thro the source code and unravel DeepSpeech, so would like to ask if you or anyone knows of a clear guide to “adding custom words to an existing model”. Thanks…

From my understanding hotwords will only work for existing vocabulary from the scorer. You will most likely have to train your own scorer (pretty much a language model plus vocab and alphabet).
The documentation is actually pretty straightforward and can be found here:

https://deepspeech.readthedocs.io/en/latest/Scorer.html#reproducing-our-external-scorer

So if you want to keep the general big corpus for open transcription but want to add your custom words you gonna have to create a custom corpus of sentences for your special vocabulary and merge it with the original corpus and than recreate the scorer with that new merged corpus.
If you actually only need the transcription to work for a small domain specific language and sentence area you can create a scorer just from the sentences and vocabulary you actually need.
I actually do this but dont use the python script provided. I use kenlm directly to create the lm and trie and than use the binary generate scorer.

Johannes

Thanks for that, but being totally new to DS that sounds like ancient Egyptian and a deep learning curve. Our app needs the full language model, plus all crater names, so students can ask to fly to any listed crater. I have a text file of all 2500 craters, but no idea how to add it.

The link you posted assumes the reader knows the input format for all the mentioned tools, so it’s raises as many questions as it answers.
Given that many new users will want to add words, streamlining this process, or at least giving a clear YouTube video will be critical to capturing new users.

Unless people clearly states what they need, it is also hard for us to know what is complicated and what is not.

It’s down to a text file, and a few commands to run.

Maybe you could show everyone on YouTube how simple it is to add custom words, would be good for the project and users I’m sure

Maybe I don’t have time and skill sets to make videos on youtube.

1 Like

I mean, seriously, please explain me what is complicated:

  • a text file, with sentences describing your language / vocab
  • run generate_lm.py to build it
  • run generate_scorer_package to make it suitable for deepspeech