Quick heads up on some metadata / confidence estimate work we're doing

Hello all,

We thought we’d mention that over at the Te Hiku / Kōrero Māori project we’re doing some work on the decode_metadata() methods to add a bit more info the the MetadataItem object, specifically around confidence per letter.

This work might feed into some projects around pronunciation. Another thing we might do with it is to show some level of confidence in our transcription UI - to give some sense for when the model is confident of a transcription and when/where it might be worth the human reviewer to look a little closer.

We made a branch at around ‘deepspeech-0.5.0a8’ and we’re hoping that we’ll be able to turn it into a PR at some point in the future.

At this point the code is working for our experimentation but the PR is not ready. We just thought it might be good to sort of mention that we’re doing this in case it overlaps with other work already going on or coming soon.

It’s really only a few changes to deepspeech.cc decode_metadata method

557: ModelState::decode_metadata(const vector<float>& logits)

and to MetadataItem in deepspeech.h (where we’ve added three new properties for now)

// Stores each individual character, along with its timing and confidence information
struct MetadataItem {
  char* character;
  int timestep; // Position of the character in units of 20ms
  float start_time; // Position of the character in seconds
  double probability; // Logit value at the time the character was chosen
  double entropy; // Entropy across all logits at the time the character was chosen 
  char* acoustic_char; // Best guess from acoustic model at timestep of chosen letter (sometimes differs from best guess overall)
};

Our current plan is to experiment with the above fields with some real world data, so we can see which of these confidence measures is actually useful, maybe tweak it a bit based on that feedback and then create a PR. So it will be a wee ways off and of course we fully expect we may have to adapt or minimize our changes even more based on feedback during the PR process.

That said, we figured we might as well get the word out there that this is something we’re working on.

In case there are replies I figure I’ll just tag my collaborators at TeHikuMedia @kmahelona and @mathematiguy and maybe I’ll tag @lissyx on this one as well, since I guess you were the person who added the timing metadata stuff in the first place.

PS Some example output… you’ll notice at 8.20 seconds the acoustic model guesses ‘n’ but the language model corrects that in the final transcription to ŋ (aka ng ).

Target transcription
# Ka whakapā a Hine ki tētahi āhuatanga whakahirahira o te whakamahinga o te reo Māori
Actual transcription (this one has 0% WER)
# ka whakapā a hine ki tētahi āhuatanga whakahirahira o te whakamahinga o te reo māori
Raw output transcription (in our new Te Reo Māori specific orthography)
# ka ƒakapā a hine ki tētahi āhuataŋa ƒakahirahira o te ƒakamahiŋa o te reo māori
Raw transcription if we only used the acoustic model 
# ka ƒakapā a hine ki tētahi āhuataŋa ƒakahirahira o te ƒakamahina o te reo māori

    
'char':seconds:'acoustic_char' probability entropy
'k':1.28:'k' prob:0.997742 entropy:0.025697
'a':1.48:'a' prob:0.999660 entropy:0.005033
' ':1.42:' ' prob:0.998576 entropy:0.015578
'ƒ':1.46:'ƒ' prob:0.999910 entropy:0.001490
'a':1.62:'a' prob:0.999978 entropy:0.000389
'k':1.60:'k' prob:0.999788 entropy:0.003169
'a':1.62:'a' prob:0.999978 entropy:0.000389
'p':1.84:'p' prob:0.991591 entropy:0.083503
'ā':1.86:'ā' prob:0.669923 entropy:0.946214
' ':2.22:' ' prob:0.898275 entropy:0.509989
'a':2.24:'a' prob:0.997645 entropy:0.026884
' ':2.46:' ' prob:0.636555 entropy:0.950531
'h':2.50:'h' prob:0.994121 entropy:0.061711
'i':2.52:'i' prob:0.998789 entropy:0.015438
'n':2.68:'n' prob:0.999938 entropy:0.000999
'e':2.70:'e' prob:0.998212 entropy:0.021415
' ':3.38:' ' prob:0.841081 entropy:0.633549
'k':3.08:'k' prob:0.965924 entropy:0.249148
'i':3.10:'i' prob:0.994305 entropy:0.060981
' ':3.38:' ' prob:0.841081 entropy:0.633549
't':3.66:'t' prob:0.999955 entropy:0.000785
'ē':3.44:'ē' prob:0.985221 entropy:0.119456
't':3.66:'t' prob:0.999955 entropy:0.000785
'a':3.68:'a' prob:0.999769 entropy:0.003429
'h':3.80:'h' prob:0.999525 entropy:0.005986
'i':3.82:'i' prob:0.999887 entropy:0.001707
' ':4.06:' ' prob:0.999827 entropy:0.002440
'ā':4.12:'ā' prob:0.999900 entropy:0.001572
'h':4.38:'h' prob:0.999580 entropy:0.005927
'u':4.40:'u' prob:0.999936 entropy:0.001080
'a':4.66:'a' prob:0.998985 entropy:0.012388
't':4.64:'t' prob:0.999953 entropy:0.000757
'a':4.82:'a' prob:0.999497 entropy:0.006564
'ŋ':4.80:'ŋ' prob:0.999840 entropy:0.002288
'a':4.82:'a' prob:0.999497 entropy:0.006564
' ':5.66:' ' prob:0.994251 entropy:0.053260
'ƒ':5.70:'ƒ' prob:0.994764 entropy:0.057157
'a':5.72:'a' prob:0.998290 entropy:0.020614
'k':5.86:'k' prob:0.995913 entropy:0.039462
'a':5.88:'a' prob:0.995949 entropy:0.040058
'h':6.08:'h' prob:0.996310 entropy:0.038118
'i':6.10:'i' prob:0.997903 entropy:0.025600
'r':6.20:'r' prob:0.999943 entropy:0.000914
'a':6.22:'a' prob:0.999786 entropy:0.003344
'h':6.36:'h' prob:0.998314 entropy:0.018087
'i':6.38:'i' prob:0.994112 entropy:0.059620
'r':6.52:'r' prob:0.999738 entropy:0.003533
'a':6.54:'a' prob:0.999723 entropy:0.003865
' ':6.86:' ' prob:0.999614 entropy:0.004946
'o':6.92:'o' prob:0.992238 entropy:0.078715
' ':7.06:' ' prob:0.999666 entropy:0.004346
't':7.08:'t' prob:0.999473 entropy:0.006716
'e':7.10:'e' prob:0.999567 entropy:0.005573
' ':7.30:' ' prob:0.999920 entropy:0.001255
'ƒ':7.34:'ƒ' prob:0.999175 entropy:0.010537
'a':7.36:'a' prob:0.999082 entropy:0.011338
'k':7.50:'k' prob:0.999428 entropy:0.007066
'a':7.80:'a' prob:0.999715 entropy:0.003959
'm':7.78:'m' prob:0.999868 entropy:0.002160
'a':7.80:'a' prob:0.999715 entropy:0.003959
'h':8.02:'h' prob:0.994506 entropy:0.051605
'i':8.04:'i' prob:0.997908 entropy:0.024319
'ŋ':8.20:'n' prob:0.163235 entropy:0.758166
'a':8.22:'a' prob:0.986215 entropy:0.106885
' ':8.84:' ' prob:0.999548 entropy:0.005679
'o':8.90:'o' prob:0.996885 entropy:0.037253
' ':9.08:' ' prob:0.999404 entropy:0.007245
't':9.10:'t' prob:0.999635 entropy:0.004874
'e':9.12:'e' prob:0.999580 entropy:0.005448
' ':9.20:' ' prob:0.999891 entropy:0.001612
'r':9.24:'r' prob:0.999954 entropy:0.000807
'e':9.26:'e' prob:0.999885 entropy:0.001855
'o':9.38:'o' prob:0.999943 entropy:0.000926
' ':9.60:' ' prob:0.999293 entropy:0.008496
'm':9.64:'m' prob:0.999928 entropy:0.001178
'ā':9.66:'ā' prob:0.999976 entropy:0.000439
'o':9.84:'o' prob:0.999944 entropy:0.000880
'r':9.94:'r' prob:0.998312 entropy:0.019489
'i':9.96:'i' prob:0.999996 entropy:0.000085
1 Like

PPS @lissyx looking at the above it looks like there is, maybe just maybe? a bug in the way ctc_beam_search_decoder works ?

Specifically, notice how the space character before after the word ‘ki’ appears to be repeated - so it appears like time goes backwards there for a bit (from 2.70 to 3.38 then back to 3.08) … ?

Is that a bug? From what we can tell just looking at this it seems to be the timestep coming straight out of ctc_beam_search_decoder that does this.

If you think it might be a bug, I can try and dig further, maybe see if I can find more examples like this and maybe even create an issue. On the other hand if it’s expected behaviour I guess I won’t do that.

Thanks

'h':2.50:'h' prob:0.994121 entropy:0.061711
'i':2.52:'i' prob:0.998789 entropy:0.015438
'n':2.68:'n' prob:0.999938 entropy:0.000999
'e':2.70:'e' prob:0.998212 entropy:0.021415
' ':3.38:' ' prob:0.841081 entropy:0.633549 *
'k':3.08:'k' prob:0.965924 entropy:0.249148
'i':3.10:'i' prob:0.994305 entropy:0.060981
' ':3.38:' ' prob:0.841081 entropy:0.633549 *
't':3.66:'t' prob:0.999955 entropy:0.000785
'ē':3.44:'ē' prob:0.985221 entropy:0.119456 

I created the initial implementation of the timing metadata. Letter confidence is definitely something I wanted to add but wasn’t sure how to extract it. This looks great - thanks for working on it.

For the client there’s definitely more value in word entropy than letter entropy. What do you think is the best way of calculating this - average entropy or peak entropy?

The entropy of each character is well defined (H = - Sum (p log p)), where the sum is over all the characters in the alphabet.

The entropy of a word is trickier, as we need to sum over probabilities of all the words. But one word-level measure that could be useful would be the probability of a word, derived from the character probabilities. We can define this recursively. If W_0 = 0 then the probability that a word (up to letter i) is incorrect is W_i = P_i W_{i-1} + (1 - P_i). Iterate this relation, up to the space, to get one minus the word probability.

1 Like

To make it clearer, here is a Gist, calculating the probability of the word ‘hine’ from within python, using the character probabilities above. This may be too simplistic—would be good to check with some real world transcriptions.

Below are some examples demonstrating (1) the transcribed text with the acoustic character probability, (2) the acoustic character with the entropy, and (3) the transcribed word with the calculated probability (from @tippy_top post above).

Correct pronunciation: audio


Incorrect pronunciation: audio


1 Like

I’m not sure, I’ll check

Yes, you can embed video.

Here is an incomplete answer. @leo will be able to provide more insight.

You need to paste a URL for a video on a separate line:

bla bla
url - https://example.com/video
yadda yadda

Example:

Hope this helps?

Best regards,
Henrik

1 Like

Yes, as Henrik said if you paste a url by itself on a line Discourse will do its best to embed that media, for example with one of your audio clips:

Up until today this wasn’t working properly because our CSP was blocking external media sources, but I’ve fixed that now.

2 Likes

Mahalo @leo it seems to be working now. The only thing is our website, https://tehiku.nz, isn’t an “allowed” site for embedding iframes. I can’t expect Mozilla to change that just for us… but it would support “the little guys” as opposed to just allowing people to embed content from the big guys :sweat_smile:.

I’ve seen lots of more examples where the timings are wrong. Any news about @utunga s first question? Is it a bug or expected behaviour? @dabinat maybe you know that a bit better? Didn’t you say something about a shift you would include to the timings as well? Thank you already