upvote
Yeah, it looks like a model issue to me. If the harness had a (semi-)deterministic bug and the model was robust to such mix-ups we'd see this behavior much more frequently. It looks like the model just starts getting confused depending on what's in the context, speakers are just tokens after all and handled in the same probabilistic way as all other tokens.
reply
The autoregressive engine should see whenever the model starts emitting tokens under the user prompt section. In fact it should have stopped before that and waited for new input. If a harness passes assistant output as user message into the conversation prompt, it's not surprising that the model would get confused. But that would be a harness bug, or, if there is no way around it, a limitation of modern prompt formats that only account for one assistant and one user in a conversation. Still, it's very bad practice to put anything as user message that did not actually come from the user. I've seen this in many apps across companies and it always causes these problems.
reply
> or if the model might actually have emitted the formatting tokens that indicate a user message.

These tokens are almost universally used as stop tokens which causes generation to stop and return control to the user.

If you didn't do this, the model would happily continue generating user + assistant pairs w/o any human input.

reply
I believe you're right, it's an issue of the model misinterpreting things that sound like user message as actual user messages. It's a known phenomenon: https://arxiv.org/abs/2603.12277
reply
Also could be a bit both, with harness constructing context in a way that model misinterprets it.
reply
author here - yeah maybe 'reasoning' is the incorrect term here, I just mean the dialogue that claude generates for itself between turns before producing the output that it gives back to the user
reply
Yeah, that's usually called "reasoning" or "thinking" tokens AFAIK, so I think the terminology is correct. But from the traces I've seen, they're usually in a sort of diary style and start with repeating the last user requests and tool results. They're not introducing new requirements out of the blue.

Also, they're usually bracketed by special tokens to distinguish them from "normal" output for both the model and the harness.

(They can get pretty weird, like in the "user said no but I think they meant yes" example from a few weeks ago. But I think that requires a few rounds of wrong conclusions and motivated reasoning before it can get to that point - and not at the beginning)

reply