Hey where'd those logits go?

Hi all, and thanks again for this awesome project.

With recent changes in 0.5.1 release (and above) I’m having trouble getting access to what used to be the logits vector<float>`` during the decode_metadata method.

Once upon a time we made some development only changes to (our version of) the 0.4.x branch to add confidence estimates at the character level. This was discussed here. a bit .

This worked great, for what we were doing then, but now we want to do this again for real this time ™. Unfortunately, for various reasons, we had to wait a bit before picking up the project of pushing this into production.

Problem is, we don’t want to make our production changes in an out of date branch, and in the meantime DeepSpeech has moved on (from 0.4.x to 0.5.1, and now, 0.6.x) so merging our changes with changes from the upstream is proving quite problematic. I’d appreciate any advice on best way forward from here.

To be more specific…

Once upon a time, decode_metadata was handed a simple vector of logits, like this:

decode_metadata(const vector<float>& logits)

This was super handy because I could just save the probabilities (of the best guess) out directly into the MetaDataItem like this (after making relevant changes to MetaDataItem )

items[i].probability =
   logits[best.timesteps[i] * num_classes + best.tokens[i]];

Also we had some entropy like calculations that we did based on the full vector of logits for that time step - where basically low entropy was considered high confidence, and conversely high entropy was more uncertainty in the acoustic model.

But the problem is, in the 0.5.1 release the signature has changed to

decode_metadata(DecoderState* state)

And in the master branch (aka 0.6.x) it’s just

decode_metadata()

So. My problem is… whats the best way to get back to the logits. Do I need to change decode_raw and save them out into the DecoderState or is there some way to save these values out without getting down into the ctc_decoder level?

Or at the very least preserve just the confidence of the alphabet char picked at a each step of the ctc_decoder output???

I guess I’ll try to muddle through but any suggestions on how best to proceed here much appreciated !

Thanks in advance…

Thought I’d tag in my collaborators @keoni and @mathematiguy on this thread, also @dabinat for good measure, hope that’s OK ;_)

PPS I’m thinking of using the PathTrie->log_prob_c as a bit of a proxy for the logit value per character. This would at least give me something vaguely resembling the acoustic confidence for the individual character.

Would that work? There’s quite a few different probability like fields defined in path_trie.cpp are these documented somewhere? Not trying to understand every detail of the beam search but it would at least be interesting to know the difference between log_prob_b_cur, log_prob_c and log_prob_nb_cur

It actually has moved yet again since the commit you linked to… :grimacing: Sorry for the churn, we hope it’s for the best, the implementation should be easier to understand and modify now than it was in v0.4.1, plus concurrent streams work!.

The crux of the difficulty for you is that since v0.4.1 we implemented a streaming decoder. This means the decoding process is done iteratively after each pass of 16 timesteps (320ms) through the acoustic model, and we no longer accumulate the full logits vector for the entire audio file. DS_IntermediateDecode now isn’t super expensive to call, and applications don’t have to worry as much about finely tuning their voice activity detection to get good interactive performance.

Right now here’s what happens. When you call a metadata method in the API, StreamingState::finishStreamWithMetadata gets called: https://github.com/mozilla/DeepSpeech/blob/cd2e8c8947c0e0469344a01b5f1322294238ac02/native_client/deepspeech.cc#L146-L151

This then calls ModelState::decode_metadata(DecoderState& state): https://github.com/mozilla/DeepSpeech/blob/cd2e8c8947c0e0469344a01b5f1322294238ac02/native_client/modelstate.cc#L45-L69

Which calls the actual decoder implementation in DecoderState::decode: https://github.com/mozilla/DeepSpeech/blob/cd2e8c8947c0e0469344a01b5f1322294238ac02/native_client/ctcdecode/ctc_beam_search_decoder.cpp#L155-L204

Unfortunately for you, none of these functions have access to the logits vector, because it was already processed and discarded as the audio was being fed, before the call to DS_FinishStreamWithMetadata.

But fortunately for you, the information you want has always been kept in the PathTrie structure used by the decoder, although slightly modified. The log_prob_c member of the PathTrie structure is log(value you want) (computed in get_pruned_log_probs). So in DecoderState::decode, when the Output structure is created, you should be able to modify it to also output those values, by changing PathTrie::get_path_vec, which currently saves character and timestep vectors, to also save a character (log-)probability vector. This should be a start: https://gist.github.com/reuben/ee38a3608c3542e6515c6b9afdb5d670 (completely untested, I haven’t even seen if it compiles).

2 Likes

Awesome! Thanks for the response. Looking through that untested gist it looks vaguely like what I’ve started sketching out… here

I also haven’t tested this yet but if it seems like the right direction may continue down this type of path ? (i.e. using log_prob_c)

Of course the thing that worries me is this ends up changing the core inference path (ie code that gets executed when doing normal inference, not just when doing inference with MetaData enabled). Is there an approach you would recommend to setting a flag or something so as not to do this type of thing when doing normal inference… or does it not really matter that much?

I’m also very intrigued by that

const int top_paths = 1;

parameter… because, for pronunciation feedback purposes etc it would be great to have a couple of alternative Output options that we can present to the end user… as in we know you were trying to say this but it sounded a bit like you said this :wink:

Late here so I’m gonna leave this for now but thanks heaps for the reply will take a very close look at the gist you posted tomorrow. Thanks!

Exposing this information shouldn’t change any of the core inference paths as it’s just changing what gets extracted out of the trie structure at decode time, so it should be safe to do without any flags. In fact that’s probably a good test to see if something is wrong with the patch: if it’s changing the output then it’s doing too much!

As for top_paths, if you change it to a bigger value you’ll get a vector with more than just the highest confidence Output from the decoder. We have a long standing issue for exposing this parameter in the API, here: https://github.com/mozilla/DeepSpeech/issues/432 It requires some thought on how to expose multiple transcriptions at the API level, maybe leveraging our Metadata structure is the easiest way.

1 Like

Brilliant thanks. And yeah since this is going ‘into production’ at some point we should be able to run a reasonable corpus of test transcriptions through and check it gives the same result. At some point of course we hope to make a proper PR too. Thanks again.