undefined

points

[-]

I don't think its mathematically equivalent or even close because the context/logprobs will be very different, since you only produce 1 token per pass. I'd say the token itself has a lot less information than the signal propagating through the residual stream of transformer blocks.

by dnhkng4 hours ago|

prev|

[-]

Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.