Time to decode increasing over time using the streaming feature

Hi, I’m calling IntermediateDecode 1 time per second (I’m still testing), in the first seconds it is fast to decode. See the red lines of the first image. They are relatively short compared to the second ones.

Here’s a video using the streaming feature and showing the decoding time increasing. (Very old AMD, without avx and avx2, so it is slower than real time)

With just 10 seconds of streaming there is noticeable difference on the decode time.

I understood that all the accumulated logits are used at every decode, is this the cause of slow decode? @reuben Let me know if I’m wrong.

I don’t see it. Can you provide better / more accurate view and plots ? I mean, there is no scale, nothing focusing / no description of where to look at. How are we supposed to read and interpret those ?

There is no more plots :/, just peeks of cpu usage over a line of time, for example in the first seconds the decode is taking 600ms, then when we already proccesed like 15 seconds the decode time goes up to 1900ms, I think the issue here is that all the accumulated logits are being evaluated every time even the very old ones.

That’s already more informations. Your previous plots are totally unusable, there is no axis documentation, no scale, one does not know what to read on them.

We explicitely accumulate logits, yes, so I’d suspect this is what we want?

But why all of them, why not just the logits for n-context, I’m wrong?