I am using Deepspeech 0.5.1. My use case is to recognise few voice commands (mainly digits and few other words). I created custom LM for my commands. Here is what I did for it:
Vocabulary.txt (containing the command to be recognised):
one
two
three
four
five
six
seven
eight
nine
yes
no
tell me options
need help
it’s working reasonably well, it does recognise the single digits and other sentences in vocabulary.txt. However it does not work well when I speak multiple digits together : e.g. “four nine seven” … It misses one or more words and mostly gives only single word output (though sometime multiple words do come).
Am I doing something wrong? How to improve results.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Please share more context on how you evaluate your performances.
I use the “mic_vad_streaming” script and speak through mic…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
So you add a lot of variability in the treatment of the sound and in reproductibility. Please record clean, low-pace audio into wav file to perform reproductible comparisons.
ok. Will try wit that, but even if it works, how do I make it work in real world…
My use case is to be able to use it on android and recognise commands spoken on phone. End user is expected to use a noise cancellation headset (so I can produce a clean audio), but can not recommend to speak in low-pace. How do I take care of pace of speech and make it work for usual speech pace.
I may be wrong here, but isn’t it because the language model order parameter is set to 5 (so it is trying to model word sequences of length 5), but the actual dataset mostly has one-word sentences?
This brings me to a question: how would you set the order parameter to lower than 3? I remember being unable to set the O parameter to 1 just recognize single words.
Or does the O parameter does not make a difference for a small dataset such as this?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
I’m asking you to perform something reproductible to isolates issues, because the usecase you describe we did successfully test it on android devices …
Also, you are testing with a user-contributed example, that can have its own flaws, and add issues.
Ok… I get it… just to summarize: with what I have done, multiple unigrams must be recognized and there is no extra configuration or customization need to be done in LM…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
Your LM is very very very small, I also know that KenLM might have a hard time doing something working properly, at least in this configuration. As @bilal.iqbal mentionned, order 5 with that few data might not yield very efficient LM.
Is it possible to print the inference logs, to look further into how LM is influencing final results…
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
11
Yes, @reuben landed some debugging helpers, but you have to rebuild the CTC decoder. Please have a look at the DEBUG define for native_client/ctcdecoder/
We are not using 0.6.0. We are on 0.5.1.
Also my problem was solved to an extent by adding permutations of all digits in my vocabulary.txt. With this, LM is picking up the combination and results are better.
Now I am adding debug statements in CTC decoder to nail it further.
Thanks for asking.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
15
That’s why I said it would be great if you confirm this is still an issue on 0.6.0
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
16
It’d be still valuable if you can evaluate on 0.6.0