Hi,
We are students currently doing their Bachelor’s Degree and during Software Testing classes our task was to test an open-source project. A few months ago I asked here what could have use some testing. I was glad to receive a message and we’ve started working. Trying our best, we have made an analysis of practical usage of hot-word feature and DeepSpeech default model’s accuracy, so you determine it’s practical quality for many scenarios.
Full report:
deepspeech_test_report.pdf (426,0 KB)
Short summary for 250 different audio files with taggings:
-
Female, unaccented, lecture speech without any proper nouns or scientific words was the most accurate. (95.6%)
-
Male voices (+ combination as above) were less accurate, however there were more files for male speech in this combination or input files may have been just a little bit harder for a model to understand. (94.0%)
-
Accent heavily lowers the accuracy (about 83.6% accuracy for lecture and 82.5% common speech) where lecture speech of non-accented speaker is above 94.6%
-
Proper nouns reduce accuracy even more (about 10% accuracy reduction in our data set for those that contained at least one of them), note that if we used more proper nouns then this accuracy difference could have been higher. It all depends on number of those words; however, this serves as a proof that in fact the difference is real and should be considered.
-
Female common speech with no additional tags has lower accurate than male common speech (87.3% female, 91.4% male).
-
Scientific vocabulary may be related to drop in accuracy as for males speaking in common voice and with scientific words accuracy was: 86.4%, while the same tag combination but without scientific words achieved 91.4% accuracy. That gives 5% drop.
As for hot-words feature:
-
Adding hot-word that has a space, like: “another one” doesn’t change behavior. Probably because it doesn’t appear in word detection mechanism and is not modified.
-
Use hot-words if you need to detect one word and you can ignore everything else that comes after that word, because of letter splitting bug. Example: “okay google”.
-
You can use negative priority for words no to occur in the output, but be careful of this word to appear as a splitted one: “another” -> “an other” or as a word of similar sound: “gold” -> ”god”.
-
Usage hot-words for calibrating accuracy is not the perfect one but add a very small priority and it could work.
-
Nonexistent words as hot-words or words that share no similarity to the given hot-word cause no change if the audio doesn’t include the sound of that word.
-
No software related errors caused by adding, removing, clearing hot-words were detected.
Any opinions would be appreciated.
We are really glad that we’ve made this far and it was an honor for us to anyhow support this great project