Custom LM causes terrible false positive rate

Hello!

We have a specific use case where the DeepSpeech tflite model will be used on an android device and needs to recognize about 30 commands. I successfully created an lm binary and trie file using the tools in the repo and kenlm. This decreased our WER by a lot, but I am noticing some funky behavior when it comes to passing the model audio that contains a sentence of words that are OOV. Instead of just ignoring the words and treating them as noise as per the restricted vocab, it tries to force the audio into one of those 30 command buckets causing a false positive.

Is there a way to retrieve a confidence or make the model more robust to changes like this? Or is there something I could try with the generation of the LM and trie?

Any help will be greatly appreciated!

Commands used:

./lmplz -o 3 < corpus.txt > lm.arpa --discount_fallback
./build_binary lm.arpa lm.binary
./generate_trie ../alphabet.txt lm.binary trie

Thanks!

2 Likes

Maybe there is some optimization when creating the LM file that allows more emphasis on ? I thought --interpolate-unigrams 0 would help with that, but I saw no difference!

Those commands does not match what we document on how to produce the language model. Can you verify after using the proper ones ?

Yes, have a look at the Metadata part in the API

This is something we already experimented successfully, although there was a bit more commands.

I’d really like to see the outcome with proper LM generation parameters.

Is what I used not correct? Any tips on that?

I am not sure where that is documented. I see this file here: https://github.com/mozilla/DeepSpeech/blob/d925e6b5fc186f3524e7c03d6eacf440d5366262/data/lm/generate_lm.py

But that includes pruning and a huge dataset. We have a small number of command phrases where some phrases are 1 word long, so we do not want to filter or prune.

I see that I changed the order from what is listed here: https://github.com/mozilla/DeepSpeech/blob/ea8e4637d34fcdbd2d0d77d821208e2e1012a59c/native_client/kenlm/README.md#estimation

I will try with an order of five and see what happens.

Also, for the smaller corpus, I am getting an error unless I use the discount fallback flag. Does this flag change the nature of the solution?

Yes, but look at the build_binary call, there’s some quantization and trie format specified.

I can’t really say for the behavior of the flag, but I do have hit the same limitation as well. Although it did not seemed to have the same impact as what you describe.

So I did it again with the updated build_binary command and rebuilt the trie and lm.binary files. This language is focused mainly on numbers along with some other commands. As a false positive smoke test, I pass the model samples from LibriSpeech just to see what it comes out as. Instead of ignore the OOV words, it tries to force it into one of the buckets.

e.g. these are sentences that are falsely positive
two nine one two 
scan one one ten one
five ten four seven
three one

This part of the error is interesting when --discount_fallback is omitted:

To override this error for e.g. a class-based model, rerun with --discount_fallback

any idea what a class-based model is?

Did you tune the LM hyperparameters alpha and beta?

I have not. I see in the repo these comments:

# The alpha hyperparameter of the CTC decoder. Language Model weight
LM_ALPHA = 0.75

# The beta hyperparameter of the CTC decoder. Word insertion bonus.
LM_BETA = 1.85

Do you have a recommendation of which would be best to focus on?

Both. Do a grid search, or a random search, and plot the error surface to see in what direction things are improving.

That is running now. Will post results when completed. Thanks!

~Update~
Some preliminary results for this show that to optimize for model execution time, low false positive rate, and lowest WER using a GA, lm_alpha needs to be greater than lm_beta and beam_width needs to be less than 50. Going to let my GA run for a bit and try to get some hard numbers for my use case.

2 Likes

Is it worth maybe looking into tuning the full model on small set of Librispeech samples but modify all words to for a few epochs and then train on small set of samples from our own audio that uses the commands? Not sure if there is much more we can optimize for on the LM side.

I don’t understand your question.

small set of Librispeech samples but modify all words to for a few epochs

What does this mean?

and then train on small set of samples from our own audio that uses the commands

You can also try fine tuning the model on just these samples. If it’s very few samples (<100) you can even do it on a laptop. Other people who reported fine tuning experiments here on the forum recommended using a much lower learning rate when doing that. We use 1e-4 for our models, maybe try 1e-6 and 1e-7, see which one works best.

I am sorry. I forgot a word.
What if I modified the Librispeech dataset so that all words are mapped to <unk> for like 100 samples. And then trained on my training set

And good to know. We may try that then. We do not have a lot of data.

So the key takeaway for us was that decoder post processing becomes much more effective once the beam width is reduced. It is hard to notice a material difference made by lm_alpha and lm_beta as beam_width is decreased. For our use case, beam_width of 5 and lm_alpha and lm_beta of 0.001 allows the decoder to spit out nonsense when OOV words are spoken, which we can easily filter out. Sort of hacky, but effective. Thanks for all the help!

Any tips for building that dataset? The audio files are all wav files at 16 bit depth with a 16kHz sample rate. Can I just build a train, dev, and valid set of files that lists the absolute path to these mapped to the transcript? Is there an example anywhere of this format? Thanks!