So I’d like to extend the language model with custom words that do not appear in the default one or only appear rarely - e.g. imagine “mozilla”, “google”, “ibm” appears in many of the samples I’d like to transcribe.
The question is, how much text I’d need to add on top of the texts used for the deepspeech default language model.
I thought the following method might work, but I’d like to see if someone has a better way of doing that:
Collect a sample text with the custom words and estimate probability of the custom words in the sample text e.g.
sample_mozilla_probability = #mozilla / #sample_words
I would like to get the same probability in the final lm training set (final = deepspeech + new text) for my custom words as I have in my custom training set so I calculate the number of occurences in the new text as
So to get my final language model training set, I can use my sample phrases containing mozilla and repeat them enough times to get mozilla_needed_count and append to the original deepspeech text.
I ignore the fact that word mozilla can already be in the deepspeech text but account for that should be as easy as counting #mozilla in deepseech text and subtract it from the mozilla_needed_count.
Briefly, it looks like a lengthy process and I wanted to know if there was an alternative. Did you try getting it to recognize “zoom in”, it’s giving “some in”.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Have you tried the new word boost feature of the API ?
Looks good, so that would allow me to move “zoom” up above “some” in the selection process? I’ve only just started today, so haven’t yet found the guidance for using the c++ or c binding with MS Visual Studio, hence I’m using Python, so I’ll add addHotWord ( word after the create call, but the boost float has no info, is 1.0 normal, and the max?
Is there any info on using addHotWord, float values? Overall explanation? Can it be used to disable words? I tried to add “node” but all it gives is “note”. Setting note to .1 doesn’t help…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
no, you have to perform your own experiments for your own application.
Okay, I seem to have found useful values now. DeepSpeech seems really excellent, hope you can get a team together to take it forward after all the layoffs.
Hi again, I’m finding that non-typical names, such as lunar craters, are not recognized even when boosted by AddHotWord. I searched “adding those words into the existing language model, as a baseline” but there are 0 references on the search engines. I don’t have the time to go thro the source code and unravel DeepSpeech, so would like to ask if you or anyone knows of a clear guide to “adding custom words to an existing model”. Thanks…
From my understanding hotwords will only work for existing vocabulary from the scorer. You will most likely have to train your own scorer (pretty much a language model plus vocab and alphabet).
The documentation is actually pretty straightforward and can be found here:
So if you want to keep the general big corpus for open transcription but want to add your custom words you gonna have to create a custom corpus of sentences for your special vocabulary and merge it with the original corpus and than recreate the scorer with that new merged corpus.
If you actually only need the transcription to work for a small domain specific language and sentence area you can create a scorer just from the sentences and vocabulary you actually need.
I actually do this but dont use the python script provided. I use kenlm directly to create the lm and trie and than use the binary generate scorer.
Thanks for that, but being totally new to DS that sounds like ancient Egyptian and a deep learning curve. Our app needs the full language model, plus all crater names, so students can ask to fly to any listed crater. I have a text file of all 2500 craters, but no idea how to add it.
The link you posted assumes the reader knows the input format for all the mentioned tools, so it’s raises as many questions as it answers.
Given that many new users will want to add words, streamlining this process, or at least giving a clear YouTube video will be critical to capturing new users.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
17
Unless people clearly states what they need, it is also hard for us to know what is complicated and what is not.
It’s down to a text file, and a few commands to run.