I want to clarify I understand the DeepSpeech source correctly. My objective is to modify the source in order to get word and character level confidence.
The candidate transcript responses only seem to return with confidence at the transcript level. This decode function seems to only aggregate scores instead of including word/character level confidence. However, I am having a little trouble understanding the scope of a prefix. Is a prefix meant to be synonymous with a candidate transcript, or is its scope supposed to be just a few words?
Also, the LM’s scorer function scores max_order number words at a time, correct? So in order to get the word-level confidence when using an LM, max_order would have to be set to 1, right?
max_order is a param when training the LMs. For most of the western language, words are separated with white spaces, the max_order is how many words you want to look ahead when calculating the conditional probabilities P(Wk|Wk-1, Wk-2… Wk-max_order-1). e.g. a bi-gram model is a LM with max_order of 2, a tri-gram model is a max_order of 3 and so on.
The prefixes are a collection of all possible outputs at specific time steps, it is ranked by the probability (over all possible combination to identical outputs) and this will get pruned at the end of each round, only the best beam_size paths will be kept.
And a word level confidence is already included in the prefixes scoring you’re using your model with a language model (the scorer), that’s what lm_alpha and lm_beta for
And a word level confidence is already included in the prefixes scoring you’re using your model with a language model (the scorer), that’s what lm_alpha and lm_beta for
It seems that in this function, lm_alpha/beta are used to weight the final prefix-level score, but do not produce word-level scores, correct?
My question, however, is about extracting the word-level score for each word in a candidate transcript. For example:
Transcript1: The quick brown fox jumped over the river.
Word-level confidence: [the, 99%], [quick, 89%], ... [jumped, 84%], ...
Transcript2: The quick brown fox slumped over the river.
Word-level confidence: [the, 99%], [quick, 89%], ... [slumped, 54%], ...
Yes, the language model is giving score in prefix-level. It is giving scores in the entire decoding process which leads the beam search results more accurate. The word confidence I mentioned above is the uni-gram probabilities, it’s not the same thing you’re asking. It is when no higher gram is found, it uses the probability of the uni-gram. Without the smoothing in language model, you can consider it to be the term frequency of that word for easy understandings.
How about making the decode function to return K best results, and do a post-processing on those candidates to get the probability of each word ?
I’m not sure If modifying the decoder is the right track, but from my understanding, I wouldn’t do that