upvote
Yeah when I read a model’s chains-of-thought I have a tendency to interrupt that because it’s going down a wrong direction. But usually the end result is still fine.
reply
It's similar to the process that transformers use when you ask them to do arithmetic without tools, I think. Some CoT tokens must be emitted up front for use as a computational substrate, but exactly what tokens they are isn't necessarily important or relevant to the final answer. And when that answer is returned, it may not be possible to tell what the actual reasoning process looked like behind the scenes.

It only makes sense that the same mechanism comes into play in strictly-verbal contexts.

Also, this is why "distillation attacks" are largely bullshit that Anthropic spreads for political purposes. Proper distillation requires access to the logits.

reply
> Proper distillation requires access to the logits

Why do you need logits? Can't you just train on cross-entropy loss of the model against the hard decision, like you do in regular pretraining?

There are definitely current-gen open-weight models (Step 3.7 Flash is one) that refer to themselves as an OpenAI model in CoT, but not in the final response.

reply
How do I get that loss, though, without the softmax inputs?
reply