So I’d like to extend the language model with custom words that do not appear in the default one or only appear rarely - e.g. imagine “mozilla”, “google”, “ibm” appears in many of the samples I’d like to transcribe.
The question is, how much text I’d need to add on top of the texts used for the deepspeech default language model.
I thought the following method might work, but I’d like to see if someone has a better way of doing that:
- Collect a sample text with the custom words and estimate probability of the custom words in the sample text e.g.
sample_mozilla_probability = #mozilla / #sample_words
- I would like to get the same probability in the final lm training set (final = deepspeech + new text) for my custom words as I have in my custom training set so I calculate the number of occurences in the new text as
mozilla_needed_count = sample_mozilla_probability * #words_deepspeech_text
- So to get my final language model training set, I can use my sample phrases containing mozilla and repeat them enough times to get mozilla_needed_count and append to the original deepspeech text.
I ignore the fact that word mozilla can already be in the deepspeech text but account for that should be as easy as counting #mozilla in deepseech text and subtract it from the mozilla_needed_count.
Does that sound reasonable?