Questions about timings coming from metadata

I have a question about the timings as it relates to metadata and would greatly appreciate any help or suggestions.

The actual problem I’m trying to work on, is getting the level of confidence (or logit probability) for each letter chosen in the transcription. Since v0.5.x we’ve had a custom build of the binaries that just grabs the probability (or log_prob) from the output or logits layer of the network and attaches it to the Trie so that it comes out as enhanced metadata along with the timings information that was already there.

This has worked well for us in the past - high probability == high confidence in the sound - and we want to use this to provide pronunciation guidance to end-users.

Unfortunately, when I apply those tweaks to the 0.7.1 build I get back data similar to the following. You’ll note that the probabilities are basically either close to 1 or close to 0.

[
{"char":"k", "start_time":1.04, "prob":0.999927},
{"char":"a", "start_time":1.06, "prob":0.975406},
{"char":" ", "start_time":1.08, "prob":0.0},
{"char":"a", "start_time":1.1,  "prob":3e-06},
{"char":"r", "start_time":1.2,  "prob":6e-06},
{"char":"o", "start_time":1.3,  "prob":6e-06},
{"char":"h", "start_time":1.38, "prob":0.0},
{"char":"i", "start_time":1.48, "prob":0.0},
{"char":"a", "start_time":1.56, "prob":0.0},
{"char":" ", "start_time":1.66, "prob":0.0},
{"char":"k", "start_time":1.8,  "prob":0.0},
{"char":"a", "start_time":1.84, "prob":2e-06},
{"char":"t", "start_time":1.88, "prob":0.0},
{"char":"o", "start_time":2.04, "prob":1e-06},
{"char":"a", "start_time":2.08, "prob":3.2e-05},
{"char":"t", "start_time":2.1,  "prob":1.4e-05},
{"char":"i", "start_time":2.16, "prob":3e-06},
{"char":"a", "start_time":2.36, "prob":0.0},
{"char":" ", "start_time":2.44, "prob":0.0},
{"char":"t", "start_time":2.48, "prob":0.0},
{"char":"e", "start_time":2.6,  "prob":1.1e-05},
{"char":" ", "start_time":2.66, "prob":0.0},
{"char":"h", "start_time":2.76, "prob":4e-06},
{"char":"ā", "start_time":2.78, "prob":7e-06},
{"char":"h", "start_time":2.8,  "prob":0.998666},
{"char":"i", "start_time":3.04, "prob":6e-06},
{"char":" ", "start_time":3.08, "prob":0.0},
{"char":"m", "start_time":3.48, "prob":0.000301},
{"char":"e", "start_time":3.5,  "prob":0.0},
{"char":" ", "start_time":3.54, "prob":0.0},
{"char":"ō", "start_time":3.62, "prob":0.0},
{"char":"n", "start_time":3.68, "prob":0.0},
{"char":"a", "start_time":3.88, "prob":0.99982},
{"char":" ", "start_time":3.9,  "prob":2e-06},
{"char":"ƒ", "start_time":4.0,  "prob":0.0},
{"char":"a", "start_time":4.06, "prob":0.999939},
{"char":"k", "start_time":4.08, "prob":0.0},
{"char":"a", "start_time":4.2,  "prob":0.996441},
{"char":"p", "start_time":4.22, "prob":0.0},
{"char":"o", "start_time":4.38, "prob":1e-06},
{"char":"n", "start_time":4.42, "prob":0.0},
{"char":"o", "start_time":4.54, "prob":5e-06},
{"char":" ", "start_time":4.58, "prob":0.0},
{"char":"e", "start_time":4.76, "prob":4e-06},
{"char":" ", "start_time":4.78, "prob":0.999591},
{"char":"t", "start_time":4.94, "prob":0.000423},
{"char":"e", "start_time":4.96, "prob":3.4e-05},
{"char":" ", "start_time":5.0,  "prob":0.0},
{"char":"h", "start_time":5.06, "prob":0.0},
{"char":"a", "start_time":5.1,  "prob":0.999757},
{"char":"p", "start_time":5.14, "prob":0.0},
{"char":"ū", "start_time":5.3,  "prob":0.000336},
{"char":" ", "start_time":5.34, "prob":1e-06},
{"char":"o", "start_time":5.66, "prob":1e-06},
{"char":" ", "start_time":5.68, "prob":0.834057},
{"char":"ō", "start_time":5.82, "prob":0.0},
{"char":"t", "start_time":5.88, "prob":0.0},
{"char":"ā", "start_time":6.12, "prob":0.000136},
{"char":"k", "start_time":6.2,  "prob":0.0},
{"char":"o", "start_time":6.44, "prob":3e-06},
{"char":"u", "start_time":6.46, "prob":0.000128}]

In order to debug this, I looked at the raw logit values and compared this to the meta data timings.

I created the following image to explain the question (Full size version here.) :

Basically I created two versions of the timings/confidences and aligned them in time, so they could be compared to the actual audio source.

The chart in the middle is just a visualization the raw logits from DeepSpeech.py in ‘infer’ method (do_single_file_inference). That is, it visualizes the audio_model before it goes through the ctc_decoder.

In both charts, if a number in a particular cell is close to 1 that cell is dark red, if it is close to 0 it is whiteish.

The chart at the top is the same audio file as source and aligned in time with the one below [1] but it uses the data from the ds.sttWithMetadata() instead. (You’ll note that, because the CTC decode has taken place by this point there are no more repeated characters.)

Nevertheless, as you can see the timings don’t align. Furthermore, when carefully listening to the sound ‘manually’ and comparing it with the timings it seems that the acoustic model is getting the letter timings right but after the CTC_decode() method the timings no longer seem to line up with the actual acoustic input.

Also, as you can see from the lack of reds in that top chart, the ‘probabilities’ I’m pulling out, in the metadata are also close to zero in most cases.

It’s actually the second problem (the almost zero probabilities) that I really want to solve, but I wonder if one causes the other. Not to mention getting the timings exactly right is also a nice idea.

I am wondering if the CTC decode method is somehow choosing the ‘right’ letter from the ‘wrong’ place in the logits stream. The CTC_Decode code is kinda complex and, not being an experienced C++ programmer its been kind of doing my head in, but I’d certainly very much welcome any guidance or suggestions as to what might be going on here. And specifically any suggestions for how I can pull out the logit probabilities from timesteps where - as you can see from the middle chart - the logit probabilities spike above 50% …

I hope this question is clear? Would be happy to add more details if it helps. Thanks!!

Miles

[1] To clarify how the charts are aligned in time. From my calculations, for our parameters, which I think are the defaults:

audio_window_samples = 512
audio_step_samples = 320
frames_per_second = 50

So. To align the charts I just multiple the frames per second (50) by the time ( eg 1.04 seconds) to get the corresponding frame, which is to say the ‘x’ coordinate for the top chart.

One change related to timings that landed in 0.7.0 was this: https://github.com/mozilla/DeepSpeech/commit/33760a6bcd5b86285418507c95caa6840513c1c3

And some context here: https://github.com/mozilla/DeepSpeech/issues/2867

The heuristic that was removed used to push the metadata timesteps forward if the probability was higher. An issue was filed recently talking about this: https://github.com/mozilla/DeepSpeech/issues/3180

I think you’re seeing the same thing, because in your case the timings are also early, as described by the reporter in issue 3180. As mentioned in the discussion there, it’s hard to reason about these heuristics and how they affect the transcription process globally.

Two quick things you can try are enabling cutoff probability pruning, for example by setting --cutoff_prob 0.99. See how that changes the timings, I suspect it’ll improve them for you. On the other hand, it may degrade accuracy too much. You should do a test report and see the effect there.

Alternatively, you can try reintroducing the heuristic deleted in that commit and seeing if it improves things for you. If it does, and it seems like it helps issue 3180 reporter as well, maybe we should just reintroduce it until we figure out a cleaner solution.

1 Like