I have a question about the timings as it relates to metadata and would greatly appreciate any help or suggestions.
The actual problem I’m trying to work on, is getting the level of confidence (or logit probability) for each letter chosen in the transcription. Since v0.5.x we’ve had a custom build of the binaries that just grabs the probability (or log_prob) from the output or logits layer of the network and attaches it to the Trie so that it comes out as enhanced metadata along with the timings information that was already there.
This has worked well for us in the past - high probability == high confidence in the sound - and we want to use this to provide pronunciation guidance to end-users.
Unfortunately, when I apply those tweaks to the 0.7.1 build I get back data similar to the following. You’ll note that the probabilities are basically either close to 1 or close to 0.
[
{"char":"k", "start_time":1.04, "prob":0.999927},
{"char":"a", "start_time":1.06, "prob":0.975406},
{"char":" ", "start_time":1.08, "prob":0.0},
{"char":"a", "start_time":1.1, "prob":3e-06},
{"char":"r", "start_time":1.2, "prob":6e-06},
{"char":"o", "start_time":1.3, "prob":6e-06},
{"char":"h", "start_time":1.38, "prob":0.0},
{"char":"i", "start_time":1.48, "prob":0.0},
{"char":"a", "start_time":1.56, "prob":0.0},
{"char":" ", "start_time":1.66, "prob":0.0},
{"char":"k", "start_time":1.8, "prob":0.0},
{"char":"a", "start_time":1.84, "prob":2e-06},
{"char":"t", "start_time":1.88, "prob":0.0},
{"char":"o", "start_time":2.04, "prob":1e-06},
{"char":"a", "start_time":2.08, "prob":3.2e-05},
{"char":"t", "start_time":2.1, "prob":1.4e-05},
{"char":"i", "start_time":2.16, "prob":3e-06},
{"char":"a", "start_time":2.36, "prob":0.0},
{"char":" ", "start_time":2.44, "prob":0.0},
{"char":"t", "start_time":2.48, "prob":0.0},
{"char":"e", "start_time":2.6, "prob":1.1e-05},
{"char":" ", "start_time":2.66, "prob":0.0},
{"char":"h", "start_time":2.76, "prob":4e-06},
{"char":"ā", "start_time":2.78, "prob":7e-06},
{"char":"h", "start_time":2.8, "prob":0.998666},
{"char":"i", "start_time":3.04, "prob":6e-06},
{"char":" ", "start_time":3.08, "prob":0.0},
{"char":"m", "start_time":3.48, "prob":0.000301},
{"char":"e", "start_time":3.5, "prob":0.0},
{"char":" ", "start_time":3.54, "prob":0.0},
{"char":"ō", "start_time":3.62, "prob":0.0},
{"char":"n", "start_time":3.68, "prob":0.0},
{"char":"a", "start_time":3.88, "prob":0.99982},
{"char":" ", "start_time":3.9, "prob":2e-06},
{"char":"ƒ", "start_time":4.0, "prob":0.0},
{"char":"a", "start_time":4.06, "prob":0.999939},
{"char":"k", "start_time":4.08, "prob":0.0},
{"char":"a", "start_time":4.2, "prob":0.996441},
{"char":"p", "start_time":4.22, "prob":0.0},
{"char":"o", "start_time":4.38, "prob":1e-06},
{"char":"n", "start_time":4.42, "prob":0.0},
{"char":"o", "start_time":4.54, "prob":5e-06},
{"char":" ", "start_time":4.58, "prob":0.0},
{"char":"e", "start_time":4.76, "prob":4e-06},
{"char":" ", "start_time":4.78, "prob":0.999591},
{"char":"t", "start_time":4.94, "prob":0.000423},
{"char":"e", "start_time":4.96, "prob":3.4e-05},
{"char":" ", "start_time":5.0, "prob":0.0},
{"char":"h", "start_time":5.06, "prob":0.0},
{"char":"a", "start_time":5.1, "prob":0.999757},
{"char":"p", "start_time":5.14, "prob":0.0},
{"char":"ū", "start_time":5.3, "prob":0.000336},
{"char":" ", "start_time":5.34, "prob":1e-06},
{"char":"o", "start_time":5.66, "prob":1e-06},
{"char":" ", "start_time":5.68, "prob":0.834057},
{"char":"ō", "start_time":5.82, "prob":0.0},
{"char":"t", "start_time":5.88, "prob":0.0},
{"char":"ā", "start_time":6.12, "prob":0.000136},
{"char":"k", "start_time":6.2, "prob":0.0},
{"char":"o", "start_time":6.44, "prob":3e-06},
{"char":"u", "start_time":6.46, "prob":0.000128}]
In order to debug this, I looked at the raw logit values and compared this to the meta data timings.
I created the following image to explain the question (Full size version here.) :
Basically I created two versions of the timings/confidences and aligned them in time, so they could be compared to the actual audio source.
The chart in the middle is just a visualization the raw logits from DeepSpeech.py in ‘infer’ method (do_single_file_inference). That is, it visualizes the audio_model before it goes through the ctc_decoder.
In both charts, if a number in a particular cell is close to 1 that cell is dark red, if it is close to 0 it is whiteish.
The chart at the top is the same audio file as source and aligned in time with the one below [1] but it uses the data from the ds.sttWithMetadata() instead. (You’ll note that, because the CTC decode has taken place by this point there are no more repeated characters.)
Nevertheless, as you can see the timings don’t align. Furthermore, when carefully listening to the sound ‘manually’ and comparing it with the timings it seems that the acoustic model is getting the letter timings right but after the CTC_decode() method the timings no longer seem to line up with the actual acoustic input.
Also, as you can see from the lack of reds in that top chart, the ‘probabilities’ I’m pulling out, in the metadata are also close to zero in most cases.
It’s actually the second problem (the almost zero probabilities) that I really want to solve, but I wonder if one causes the other. Not to mention getting the timings exactly right is also a nice idea.
I am wondering if the CTC decode method is somehow choosing the ‘right’ letter from the ‘wrong’ place in the logits stream. The CTC_Decode code is kinda complex and, not being an experienced C++ programmer its been kind of doing my head in, but I’d certainly very much welcome any guidance or suggestions as to what might be going on here. And specifically any suggestions for how I can pull out the logit probabilities from timesteps where - as you can see from the middle chart - the logit probabilities spike above 50% …
I hope this question is clear? Would be happy to add more details if it helps. Thanks!!
Miles
[1] To clarify how the charts are aligned in time. From my calculations, for our parameters, which I think are the defaults:
audio_window_samples = 512
audio_step_samples = 320
frames_per_second = 50
So. To align the charts I just multiple the frames per second (50) by the time ( eg 1.04 seconds) to get the corresponding frame, which is to say the ‘x’ coordinate for the top chart.