Adding custom words to language model

yv001 · January 31, 2019, 10:46am

So I’d like to extend the language model with custom words that do not appear in the default one or only appear rarely - e.g. imagine “mozilla”, “google”, “ibm” appears in many of the samples I’d like to transcribe.

The question is, how much text I’d need to add on top of the texts used for the deepspeech default language model.

I thought the following method might work, but I’d like to see if someone has a better way of doing that:

Collect a sample text with the custom words and estimate probability of the custom words in the sample text e.g.
sample_mozilla_probability = #mozilla / #sample_words
I would like to get the same probability in the final lm training set (final = deepspeech + new text) for my custom words as I have in my custom training set so I calculate the number of occurences in the new text as

mozilla_needed_count = sample_mozilla_probability * #words_deepspeech_text

So to get my final language model training set, I can use my sample phrases containing mozilla and repeat them enough times to get mozilla_needed_count and append to the original deepspeech text.

I ignore the fact that word mozilla can already be in the deepspeech text but account for that should be as easy as counting #mozilla in deepseech text and subtract it from the mozilla_needed_count.

Does that sound reasonable?

lissyx · March 4, 2019, 12:33pm

Have you tried just adding those words into the existing language model, as a baseline ?

Rob_Christopher · February 2, 2021, 12:16pm

How is that done? I’d like to add “zoom” to the US English model as it never recognizes it!

lissyx · February 2, 2021, 1:04pm

Have you read the docs about building the scorer ?

Rob_Christopher · February 2, 2021, 2:31pm

Briefly, it looks like a lengthy process and I wanted to know if there was an alternative. Did you try getting it to recognize “zoom in”, it’s giving “some in”.

lissyx · February 2, 2021, 2:35pm

Have you tried the new word boost feature of the API ?

Rob_Christopher · February 2, 2021, 3:23pm

No, I hadn’t found that yet, sounds interesting, is there a guidance page or document?

lissyx · February 2, 2021, 3:38pm

It’s part of the API, e.g., for C it’s https://deepspeech.readthedocs.io/en/v0.9.3/C-API.html#_CPPv413DS_AddHotWordP10ModelStatePKcf

Rob_Christopher · February 2, 2021, 10:04pm

Looks good, so that would allow me to move “zoom” up above “some” in the selection process? I’ve only just started today, so haven’t yet found the guidance for using the c++ or c binding with MS Visual Studio, hence I’m using Python, so I’ll add addHotWord ( word after the create call, but the boost float has no info, is 1.0 normal, and the max?

Rob_Christopher · February 3, 2021, 5:16pm

Is there any info on using addHotWord, float values? Overall explanation? Can it be used to disable words? I tried to add “node” but all it gives is “note”. Setting note to .1 doesn’t help…

lissyx · February 3, 2021, 7:42pm

no, you have to perform your own experiments for your own application.

Rob_Christopher · February 3, 2021, 7:44pm

Okay, I seem to have found useful values now. DeepSpeech seems really excellent, hope you can get a team together to take it forward after all the layoffs.

Rob_Christopher · February 9, 2021, 1:35pm

Hi again, I’m finding that non-typical names, such as lunar craters, are not recognized even when boosted by AddHotWord. I searched “adding those words into the existing language model, as a baseline” but there are 0 references on the search engines. I don’t have the time to go thro the source code and unravel DeepSpeech, so would like to ask if you or anyone knows of a clear guide to “adding custom words to an existing model”. Thanks…

JGKK · February 9, 2021, 2:13pm

From my understanding hotwords will only work for existing vocabulary from the scorer. You will most likely have to train your own scorer (pretty much a language model plus vocab and alphabet).
The documentation is actually pretty straightforward and can be found here:

https://deepspeech.readthedocs.io/en/latest/Scorer.html#reproducing-our-external-scorer

So if you want to keep the general big corpus for open transcription but want to add your custom words you gonna have to create a custom corpus of sentences for your special vocabulary and merge it with the original corpus and than recreate the scorer with that new merged corpus.
If you actually only need the transcription to work for a small domain specific language and sentence area you can create a scorer just from the sentences and vocabulary you actually need.
I actually do this but dont use the python script provided. I use kenlm directly to create the lm and trie and than use the binary generate scorer.

Johannes

Rob_Christopher · February 9, 2021, 2:24pm

Thanks for that, but being totally new to DS that sounds like ancient Egyptian and a deep learning curve. Our app needs the full language model, plus all crater names, so students can ask to fly to any listed crater. I have a text file of all 2500 craters, but no idea how to add it.

Rob_Christopher · February 9, 2021, 2:30pm

The link you posted assumes the reader knows the input format for all the mentioned tools, so it’s raises as many questions as it answers.
Given that many new users will want to add words, streamlining this process, or at least giving a clear YouTube video will be critical to capturing new users.

lissyx · February 9, 2021, 4:09pm

Unless people clearly states what they need, it is also hard for us to know what is complicated and what is not.

It’s down to a text file, and a few commands to run.

Rob_Christopher · February 9, 2021, 4:15pm

Maybe you could show everyone on YouTube how simple it is to add custom words, would be good for the project and users I’m sure

lissyx · February 9, 2021, 4:18pm

Maybe I don’t have time and skill sets to make videos on youtube.

lissyx · February 9, 2021, 4:19pm

I mean, seriously, please explain me what is complicated:

a text file, with sentences describing your language / vocab
run generate_lm.py to build it
run generate_scorer_package to make it suitable for deepspeech

Topic		Replies	Views
Tune MoziilaDeepSpeech to recognize specific sentences DeepSpeech	76	11533	March 25, 2023
Help: how to generate a custom scorer? DeepSpeech	18	2730	August 13, 2021
Fine tune the Language Model DeepSpeech	3	499	December 6, 2019
Learning new words for STT DeepSpeech	8	572	November 10, 2020
DeepSpeech for narrow-domain bot creation DeepSpeech	26	1121	February 11, 2021

Adding custom words to language model

Related topics