You are correct in my intentions on this post generally.
I want to highlight:
I want to measure performance of the LLMs over time- which includes assessing the quality of their outputs. I don’t perceive the reasoning output to be anything other than a measurable signal of possible drift in model performance.
Except it isn’t, because I’m only getting a low value summary of the thinking.
It’s like asking your buddy how fast he thought that last pitch was when radar guns are behind the plate.
Yeah, it’s a description related to what happened, but it’s not the thing I want to measure.
It only makes sense that the same mechanism comes into play in strictly-verbal contexts.
Also, this is why "distillation attacks" are largely bullshit that Anthropic spreads for political purposes. Proper distillation requires access to the logits.
Why do you need logits? Can't you just train on cross-entropy loss of the model against the hard decision, like you do in regular pretraining?
There are definitely current-gen open-weight models (Step 3.7 Flash is one) that refer to themselves as an OpenAI model in CoT, but not in the final response.