Per word confidence

Hey guys! Does anyone know how to detect the confidence/probability per word when you inference a sentence in DeepSpeech? in example with the sentence:
“Hey how are you?”

Hey = 90% confident
how = 65% confident
…etc

This should be coming soon: https://github.com/mozilla/DeepSpeech/pull/2012

However, note that it is per-letter probability, not per word, and I’m not sure exactly how it will be exposed to clients from the API.

That PR exposes per-transcription probability, not per letter or per word. Doing either of those requires extending the decoder to keep track of the character/word level info.

Alright cheers thanks guys!

Per word timings would require we embed language info into the engine.

For example, in English you can, more-or-less split on spaces to get words. However, for Simplified Chinese Mandarin each character is a word. So code that split on spaces would not split on words for Mandarin. So there would have to be code in the engine that works differently for different languages if we split on words.

Embedding language specific info into the engine is not something we want to do. We want the engine to remain as language independent as possible.

Kelly, note this is about probabilities (approximate confidence values), not timings. We can compute per character probability (per timestep token in a generic way) but we don’t currently do it and it would be a significant overhead in the state size for the decoder.

Sorry, per word probabilities, not character. Currently we only store per-beam (per transcription candidate) probabilities.

Alright thank you. I really appreciate the replies :slight_smile:

Sorry, don’t know why I ended up saying “timings” and not “probabilities”.

But you seem to suggest, assuming you’re trying to do this in a language independent manner, that it possible to segment the output text on words. How would that work in a language independent manner?

I’m not suggesting that, just saying anything finer grained than per-candidate sentence probability requires extending the decoder.