Feature Request: Spotting Keywords

I’ve been monitoring the impressive work done on this project. What I would like to see is the ability to detect keyword - e.g. wake words and short commands. The detectable keywords should be configurable and the STT engine should focus it’s results to match ONLY those keywords - as in everything is a nail if your a hammer. In other words - if the configured keyword would be “house” - even recognized words such as “mouse” could be matched up to a certain (maybe even configurable) threshold.

Thanks for your consideration.

Have you tried a dedicated language model, with only those keywords ? We did some experiments using that, in several contexts, and it proved to work quite well with the released english model.

@lissyx one of the key requirements for wake word or key word is that the solution should be always listening and so has to be computationally efficient with the reduced foot print in terms of flash and RAM usage. Can deepspeech acoustic model and language model that is trained only to recognize the key word, result in low foot print?

Take a look at https://github.com/MycroftAI/mycroft-precise

Well … a dedicated language model might work, but the idea is to change the keywords “on the fly” - depending on the environment e.g, in a home automation system were the operator installs and names a new switch.

@lissyx, @reuben Coming back to this topic, I was trying to experiment deepspeech model for wakeword detection by reducing the number of hidden units from 2048 to 256 as part of the training using a custom dataset with samples generated from various speakers for 10 small words(max of 10 characters) with approx 1.5K samples for each word.

Though the CPU load for inference reduced by 8 times from the original deepspeech model, this still was not efficient for a keyword system. Do you have any suggestions on modifying the parameters used for training, such as further reducing the size of hidden units to create a more smaller network to predict 10 to 20 characters?

that seems quite radical. I could build something quite efficient on Android with the default english model, and just a dedicated language model with command-words. COuld you document exactly your constraints ?

@lissyx specific concern with using the pre-trained model as such, would be the usage of high CPU, since this model is expected to be always active as it has to decipher each word being spoken, it will not be affordable to run it with heavy CPU usage.

For example I was running the TFLite version of the pre-trained model on a Qualcomm 820 HW to infer speech all the time and it takes almost 100% of CPU(which is close to 2.0 GHz of processing power).

This is the reason I was trying to reduce the n_hidden units, as this reduces the complexity of the model and was thinking that will be efficient to decode a single word. Do you have some suggestions on reducing the model size for a key word detection process to make it more CPU efficient?

You may want to take a look to https://www.tensorflow.org/tutorials/sequences/audio_recognition

Though it should take no more than one core. It’s still high if you have continuous recognition, but if you use VAD and the Streaming API, then you won’t have 100% of one core 100% of the time.

No because we are not working on that, so we can’t give more insight than “adapt and test”. You may want to check what @elpimous_robot did with his robot, though it was very early deepspeech, and the model was a bit different.

And FYI this is exactly the setup in mozillaspeechlibrary, I have a PR open against that Android Component to add DeepSpeech there.

My observation is that the STT industry (particularly commercial offerings) seem to have split to short utterance-based (< 30s, typically 2-5s) and arbitrary length dictation / transcription. They’re being solved in different ways. Long-form is much harder.

For example see picovoice that the author says is ultra high performance, embeddable, offline-able, and could be a nice complement to DS due it being optimised for short utterances (less than 30s).

Yeah I think some of the efficient KWS have a ‘is keyword’ / ‘not keyword’ and that helps with model size and accuracy.

Google ran “Visual Wake Words Challenge”, soliciting submissions of tiny vision models for microcontrollers.
Which https://github.com/mit-han-lab/VWW won and even though visual guess with Spectrograms preprocessing MFCC (Mel Filter) be it image or voice it doesn’t matter.

I noticed another article https://blog.aspiresys.pl/technology/building-jarvis-nlp-hot-word-detection/ that also blurs voice/image with Spectrograms preprocessing MFCC (Mel Filter).

Seemed to be built with an extremely low audio lib and also split into ‘is keyword’ / ‘not keyword’.

  • 200 positive samples recorded over the varying degree of background noise.
  • 100 positive samples recorded over silence.
  • 200 negative samples of random words recorded over varying degree of background noise.
  • 100 negative samples recorded over silence.

Its says starting dataset but from what I have seen that is very low and should make a tiny model.

From playing with a PI4 the load of the standard model download and install of deepspeech seems less load than what I have seen on Mycroft Precise KWS.

So thought I would ask is there anyway with deepspeech to create that is/not model that seems to help with low size models?

Also would be really great to get an accompanying KWS for Deepspeech as doesn’t need amazingly accurate but just some opensource using pretty much the same lib set that is more of a mock setup of how KWS & Deepspeech should interact.

But maybe the mentioned above could form a basic KWS for deepspeech?
Apols about the necro but a demo kws would be great.

The current approach being investigated for integrating KWS into DeepSpeech is using https://arxiv.org/abs/1611.09405

Because it’s a decoder extension over a trained CTC model it requires no further training, just the code to be written and integrated into the API. Further improvements are possible, for example fine tuning keyword detection to the way a specific person says it via an audio recording, but first we want to figure out how to add just the KWS from text part.

@reuben the KWS news is great is there any plans to be able to turn VAD off and accept external input?

I have a RK3308 on my desk that I haven’t tried yet so haven’t a clue really but it does have embedded DSP VAD.
Its $15 SoC from https://wiki.radxa.com/RockpiS and seems to be a trend for audio specific SoC with VAD that can even wake from sleep on single Mic but awake is 4 channel.

KWS into Deepspeech would be so cool :slight_smile: as was just playing with a Pi4 with V0.70 and very much impressed with the load that actually the Pi3A+ needs to have a trial.
Struggled to get the Deepspech Aaarch64 version installed on the Cortex A35+ of the RK3308 but should be approx Pi3 level.

If you can make the software VAD configurable to internal on or off that can accept external input.
Having software VAD is a great addition but being also able to use a VAD enabled SoC without that load would also be great.
Dunno how good the DSP/Alg is but choice would be a great option.
Allwinner have there R328 A7 and VAD enabled SoCs are becoming quite common.
I have been running the SpeexDSP AEC (software) on a Pi3 quite successfully but is on the cusp of getting load heavy so actually the embedded DSP of the RK3308 is actually a big advantage.
Haven’t been able to check pulseaudio Webrtc AEC as the soundcard I have doesn’t seem to like pulseaudio infact doesn’t seem to like much and just wondering if you guys or anyone knows of a Webrtc_audio_processing implementation for Alsa?

The KWS will be amazing and itching to test :slight_smile: