Extending ctc decoder to output word and character-level confidence

KieranGill · December 11, 2020, 8:08pm

I want to clarify I understand the DeepSpeech source correctly. My objective is to modify the source in order to get word and character level confidence.

The candidate transcript responses only seem to return with confidence at the transcript level. This decode function seems to only aggregate scores instead of including word/character level confidence. However, I am having a little trouble understanding the scope of a prefix. Is a prefix meant to be synonymous with a candidate transcript, or is its scope supposed to be just a few words?

Also, the LM’s scorer function scores max_order number words at a time, correct? So in order to get the word-level confidence when using an LM, max_order would have to be set to 1, right?

hawa · December 14, 2020, 12:57pm

max_order is a param when training the LMs. For most of the western language, words are separated with white spaces, the max_order is how many words you want to look ahead when calculating the conditional probabilities P(Wk|Wk-1, Wk-2… Wk-max_order-1). e.g. a bi-gram model is a LM with max_order of 2, a tri-gram model is a max_order of 3 and so on.

The prefixes are a collection of all possible outputs at specific time steps, it is ranked by the probability (over all possible combination to identical outputs) and this will get pruned at the end of each round, only the best beam_size paths will be kept.

And a word level confidence is already included in the prefixes scoring you’re using your model with a language model (the scorer), that’s what lm_alpha and lm_beta for

KieranGill · December 14, 2020, 9:25pm

@hawa Thank you for the clarification!

I’m still a little confused when you say:

And a word level confidence is already included in the prefixes scoring you’re using your model with a language model (the scorer), that’s what lm_alpha and lm_beta for

It seems that in this function, lm_alpha/beta are used to weight the final prefix-level score, but do not produce word-level scores, correct?

    score = ext_scorer_->get_log_cond_prob(ngram, bos) * ext_scorer_->alpha;
    score += ext_scorer_->beta;
    scores[prefix] += score;

My question, however, is about extracting the word-level score for each word in a candidate transcript. For example:

Transcript1: The quick brown fox jumped over the river.
Word-level confidence: [the, 99%], [quick, 89%], ... [jumped, 84%], ...

Transcript2: The quick brown fox slumped over the river.
Word-level confidence: [the, 99%], [quick, 89%], ... [slumped, 54%], ...

hawa · December 15, 2020, 8:10am

Yes, the language model is giving score in prefix-level. It is giving scores in the entire decoding process which leads the beam search results more accurate. The word confidence I mentioned above is the uni-gram probabilities, it’s not the same thing you’re asking. It is when no higher gram is found, it uses the probability of the uni-gram. Without the smoothing in language model, you can consider it to be the term frequency of that word for easy understandings.

How about making the decode function to return K best results, and do a post-processing on those candidates to get the probability of each word ?

I’m not sure If modifying the decoder is the right track, but from my understanding, I wouldn’t do that

Topic		Replies	Views
How does the scorer in DeepSpeech 0.7 work? DeepSpeech	0	889	April 30, 2020
How exactly the decoder (and especially tf.nn.ctc_beam_search_decoder) works? DeepSpeech	9	3879	May 21, 2018
Obtain per-word confidence score DeepSpeech	1	1035	September 12, 2019
How to get the confidence level of the words during prediction? DeepSpeech	1	480	February 14, 2020
Quick heads up on some metadata / confidence estimate work we're doing DeepSpeech	10	1259	July 30, 2019

Extending ctc decoder to output word and character-level confidence

Related topics