You can even construct a raw prompt and tell it your own messaging structure just via the prompt. During my initial tinkering with a local model I did it this way because I didn't know about the special delimiters. It actually kind of worked and I got it to call tools. Was just more unreliable. And it also did some weird stuff like repeating the problem statement that it should act on with a tool call and got in loops where it posed itself similar problems and then tried to fix them with tool calls. Very weird.
In any case, I think the lesson here is that it's all just probabilistic. When it works and the agent does something useful or even clever, then it feels a bit like magic. But that's misleading and dangerous.
If terraform were to abide, I'd hope at the very least it would check if in a pipeline or under an agent. This should be obvious from file descriptors/env.
What about the next thing that might make a suggestion relying on our discretion? Patch it for agent safety?
Even a first party suggestion can be wrong in context, and if a malicious actor managed to substitute that message with a suggestion of their own, humans would fall for the trick even more than LLMs do.
See also: phishing.
Discretion, etc. We understand that was the tool making a suggestion, not our idea. Our agency isn't in question.
The removal proposal is similar to wanting a phishing-free environment instead of preparing for the inevitability. I could see removing this message based on your point of context/utility, but not to protect the agent. We get no such protection, just training and practice.
A supply chain attack is another matter entirely; I'm sure people would pause at a new suggestion that deviates from their plan/training. As shown, autobots are eager to roll out and easily drown in context. So much so that `User` and `stdout` get confused.
Claude in testing would interrupt too much to ask for clarifying questions. So as a heavy handed fix they turn down the sampling probability of <end of turn> token which hands back to the user for clarifications.
So it doesn't hand back to the user, but the internal layers expected an end of turn, so you get this weird sort of self answering behaviour as a result.
As an aside my big reason for believing this, is that this sort of dumb simple patch laid onto of a existing behaviour is often the kind of solution optimizers find. Like if you made a dataset with lots of pairs Where one side has lots of <end of turns> and one side does not. The harder thing to learn tends to be to "ask fewer questions and work more autonomously" while the easy thing to learn "less end of turn tokens" tends to get learned way faster.
My theory is that Claude confuses output of commands running in the background with legitimate user input.