Practical tests of hot-word feature and default model's accuracy


We are students currently doing their Bachelor’s Degree and during Software Testing classes our task was to test an open-source project. A few months ago I asked here what could have use some testing. I was glad to receive a message and we’ve started working. Trying our best, we have made an analysis of practical usage of hot-word feature and DeepSpeech default model’s accuracy, so you determine it’s practical quality for many scenarios.

Full report:

deepspeech_test_report.pdf (426,0 KB)

Short summary for 250 different audio files with taggings:

  • Female, unaccented, lecture speech without any proper nouns or scientific words was the most accurate. (95.6%)

  • Male voices (+ combination as above) were less accurate, however there were more files for male speech in this combination or input files may have been just a little bit harder for a model to understand. (94.0%)

  • Accent heavily lowers the accuracy (about 83.6% accuracy for lecture and 82.5% common speech) where lecture speech of non-accented speaker is above 94.6%

  • Proper nouns reduce accuracy even more (about 10% accuracy reduction in our data set for those that contained at least one of them), note that if we used more proper nouns then this accuracy difference could have been higher. It all depends on number of those words; however, this serves as a proof that in fact the difference is real and should be considered.

  • Female common speech with no additional tags has lower accurate than male common speech (87.3% female, 91.4% male).

  • Scientific vocabulary may be related to drop in accuracy as for males speaking in common voice and with scientific words accuracy was: 86.4%, while the same tag combination but without scientific words achieved 91.4% accuracy. That gives 5% drop.

As for hot-words feature:

  • Adding hot-word that has a space, like: “another one” doesn’t change behavior. Probably because it doesn’t appear in word detection mechanism and is not modified.

  • Use hot-words if you need to detect one word and you can ignore everything else that comes after that word, because of letter splitting bug. Example: “okay google”.

  • You can use negative priority for words no to occur in the output, but be careful of this word to appear as a splitted one: “another” -> “an other” or as a word of similar sound: “gold” -> ”god”.

  • Usage hot-words for calibrating accuracy is not the perfect one but add a very small priority and it could work.

  • Nonexistent words as hot-words or words that share no similarity to the given hot-word cause no change if the audio doesn’t include the sound of that word.

  • No software related errors caused by adding, removing, clearing hot-words were detected.

Any opinions would be appreciated.

We are really glad that we’ve made this far and it was an honor for us to anyhow support this great project :slight_smile:


Great work, thanks guys and all the best.

Read in the other post, that you have code and considerable experience with hotword boosting. If you have a little bit of time, open a github repo and make a simple readme with your findings and links to the code. This way other people find it more easily.

If you have a little bit more time, check the DS docs and insert some documentation on how this feature works with examples. More and more people are asking about.

Again, thanks for your time, it is great to know more about how it works.

(@kreid, don’t know whether that’s within the scope, but this is good material :slight_smile: )

You can count on me, I’ll do it in my spare time :blush:

I’ve added the test code for hot-word testing to the public repository on github, I could not edit the original post, so I put it here:

Also, I’ll soon edit the DS docs based on those findings. Not sure how long it would take as my exams are closer and closer :stuck_out_tongue:


Thanks a lot, we can link to it now if there are any questions about them.

1 Like

I am new to this so I apologize if the question is basic. What you did is the same as the “ok google” or the “hey siri”?

I am looking for a wakeword / hotword for DeepSpeech.

I ran the code and always show me the same values:

TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
['cold', 'hot'] = (-100.0, -100.0) :: []
['cold', 'hot'] = (-100.0, 0.0) :: []
['cold', 'hot'] = (-100.0, 100.0) :: []
['cold', 'hot'] = (0.0, -100.0) :: []
['cold', 'hot'] = (0.0, 0.0) :: []
['cold', 'hot'] = (0.0, 100.0) :: []
['cold', 'hot'] = (100.0, -100.0) :: []
['cold', 'hot'] = (100.0, 0.0) :: []
['cold', 'hot'] = (100.0, 100.0) :: []

For wakewords for default english model use the words that are commonly used in english language for the best results. Proper nouns like names “Siri” may not be the best choice but it is worth trying and finding it by yourself.

Did you try running the audio file you are using with method described in official documentation? I mean: ( Lack of transcription may be caused with wrong format of file. It’s only able to read PCM mono 16kHz 16-bits file and it might fail on some WAVE file that are not following exactly the specification. I used Audacity to convert ‘.wav’ to fulfill this specification.

I can’t understand. I have read the documentation but still can’t understand the use of hot_words. Does it add that word in model/scorer or does it not store it anywhere. If it stores it is it permanent like if I give a new word as hot_word will it recognize it next time faster? Please help I am new and trying to understand it