We have a specific use case where the DeepSpeech tflite model will be used on an android device and needs to recognize about 30 commands. I successfully created an lm binary and trie file using the tools in the repo and kenlm. This decreased our WER by a lot, but I am noticing some funky behavior when it comes to passing the model audio that contains a sentence of words that are OOV. Instead of just ignoring the words and treating them as noise as per the restricted vocab, it tries to force the audio into one of those 30 command buckets causing a false positive.
Is there a way to retrieve a confidence or make the model more robust to changes like this? Or is there something I could try with the generation of the LM and trie?
Maybe there is some optimization when creating the LM file that allows more emphasis on ? I thought --interpolate-unigrams 0 would help with that, but I saw no difference!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
3
Those commands does not match what we document on how to produce the language model. Can you verify after using the proper ones ?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
Yes, have a look at the Metadata part in the API
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
This is something we already experimented successfully, although there was a bit more commands.
I’d really like to see the outcome with proper LM generation parameters.
But that includes pruning and a huge dataset. We have a small number of command phrases where some phrases are 1 word long, so we do not want to filter or prune.
Also, for the smaller corpus, I am getting an error unless I use the discount fallback flag. Does this flag change the nature of the solution?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
Yes, but look at the build_binary call, there’s some quantization and trie format specified.
I can’t really say for the behavior of the flag, but I do have hit the same limitation as well. Although it did not seemed to have the same impact as what you describe.
So I did it again with the updated build_binary command and rebuilt the trie and lm.binary files. This language is focused mainly on numbers along with some other commands. As a false positive smoke test, I pass the model samples from LibriSpeech just to see what it comes out as. Instead of ignore the OOV words, it tries to force it into one of the buckets.
e.g. these are sentences that are falsely positive
two nine one two
scan one one ten one
five ten four seven
three one
This part of the error is interesting when --discount_fallback is omitted:
To override this error for e.g. a class-based model, rerun with --discount_fallback
# The alpha hyperparameter of the CTC decoder. Language Model weight
LM_ALPHA = 0.75
# The beta hyperparameter of the CTC decoder. Word insertion bonus.
LM_BETA = 1.85
Do you have a recommendation of which would be best to focus on?
~Update~
Some preliminary results for this show that to optimize for model execution time, low false positive rate, and lowest WER using a GA, lm_alpha needs to be greater than lm_beta and beam_width needs to be less than 50. Going to let my GA run for a bit and try to get some hard numbers for my use case.
Is it worth maybe looking into tuning the full model on small set of Librispeech samples but modify all words to for a few epochs and then train on small set of samples from our own audio that uses the commands? Not sure if there is much more we can optimize for on the LM side.
small set of Librispeech samples but modify all words to for a few epochs
What does this mean?
and then train on small set of samples from our own audio that uses the commands
You can also try fine tuning the model on just these samples. If it’s very few samples (<100) you can even do it on a laptop. Other people who reported fine tuning experiments here on the forum recommended using a much lower learning rate when doing that. We use 1e-4 for our models, maybe try 1e-6 and 1e-7, see which one works best.
I am sorry. I forgot a word.
What if I modified the Librispeech dataset so that all words are mapped to <unk> for like 100 samples. And then trained on my training set
And good to know. We may try that then. We do not have a lot of data.
So the key takeaway for us was that decoder post processing becomes much more effective once the beam width is reduced. It is hard to notice a material difference made by lm_alpha and lm_beta as beam_width is decreased. For our use case, beam_width of 5 and lm_alpha and lm_beta of 0.001 allows the decoder to spit out nonsense when OOV words are spoken, which we can easily filter out. Sort of hacky, but effective. Thanks for all the help!
Any tips for building that dataset? The audio files are all wav files at 16 bit depth with a 16kHz sample rate. Can I just build a train, dev, and valid set of files that lists the absolute path to these mapped to the transcript? Is there an example anywhere of this format? Thanks!