undefined

points

[-]

For those who are wondering: These LLMs are trained on special delimiters that mark different sources of messages. There's typical something like [system][/system], then one for agent, user and tool. There are also different delimiter shapes.

You can even construct a raw prompt and tell it your own messaging structure just via the prompt. During my initial tinkering with a local model I did it this way because I didn't know about the special delimiters. It actually kind of worked and I got it to call tools. Was just more unreliable. And it also did some weird stuff like repeating the problem statement that it should act on with a tool call and got in loops where it posed itself similar problems and then tried to fix them with tool calls. Very weird.

In any case, I think the lesson here is that it's all just probabilistic. When it works and the agent does something useful or even clever, then it feels a bit like magic. But that's misleading and dangerous.

by swellep8 hours ago|

prev|

[-]

I've seen something similar. It's hard to get Claude to stop committing by itself after granting it the permission to do so once.

by toraway44 minutes ago|

parent|

[-]

Yep, I've discovered the same. Approve once in a session, and from then on you're rolling the dice if it will self-approve a change and commit/push (or whatever you originally asked it to do) for as long as it stays in context.

by sixhobbits10 hours ago|

prev|

[-]

amazing example, I added it to the article, hope that's ok :)

by ares62310 hours ago|

prev|

[-]

I wonder if tools like Terraform should remove the message "Run terraform apply plan.out next" that it prints after every `terraform plan` is run.

by bravetraveler9 hours ago|

parent|

[-]

I don't think so, feels like the wrong side is getting attention. Degrading the experience for humans (in one tool) because the bots are prone to injection (from any tool). Terraform is used outside of agents; somebody surely finds the reminder helpful.

If terraform were to abide, I'd hope at the very least it would check if in a pipeline or under an agent. This should be obvious from file descriptors/env.

What about the next thing that might make a suggestion relying on our discretion? Patch it for agent safety?

by TeMPOraL9 hours ago|

parent|

[-]

"Run terraform apply plan.out next" in this context is a prompt injection for an LLM to exactly the same degree it is for a human.

Even a first party suggestion can be wrong in context, and if a malicious actor managed to substitute that message with a suggestion of their own, humans would fall for the trick even more than LLMs do.

See also: phishing.

by bravetraveler9 hours ago|

parent|

[-]

Right, I'm fine with humans making the call. We're not so injection-happy/easily confused, apparently.

Discretion, etc. We understand that was the tool making a suggestion, not our idea. Our agency isn't in question.

The removal proposal is similar to wanting a phishing-free environment instead of preparing for the inevitability. I could see removing this message based on your point of context/utility, but not to protect the agent. We get no such protection, just training and practice.

A supply chain attack is another matter entirely; I'm sure people would pause at a new suggestion that deviates from their plan/training. As shown, autobots are eager to roll out and easily drown in context. So much so that `User` and `stdout` get confused.

by franktankbank8 hours ago|

parent|

prev|

[-]

Maybe the agents should require some sort of input start token: "simon says"

by 8note9 hours ago|

parent|

prev|

[-]

it makes you wonder how many times people have incorrectly followed those recommended commands

by bravetraveler9 hours ago|

parent|

[-]

If more than once (individually), I am concerned.

by empressplay6 hours ago|

prev|

[-]

I wonder if this is a result of auto-compacting the context? Maybe when it processes it it inadvertently strips out its own [Header:] and then decides to answer its own questions.

by kingstnap1 hours ago|

parent|

[-]

My own guess is that something like this happened:

Claude in testing would interrupt too much to ask for clarifying questions. So as a heavy handed fix they turn down the sampling probability of <end of turn> token which hands back to the user for clarifications.

So it doesn't hand back to the user, but the internal layers expected an end of turn, so you get this weird sort of self answering behaviour as a result.

As an aside my big reason for believing this, is that this sort of dumb simple patch laid onto of a existing behaviour is often the kind of solution optimizers find. Like if you made a dataset with lots of pairs Where one side has lots of <end of turns> and one side does not. The harder thing to learn tends to be to "ask fewer questions and work more autonomously" while the easy thing to learn "less end of turn tokens" tends to get learned way faster.

by nathell6 hours ago|

parent|

prev|

[-]

I don’t think so, at least not in this particular case. This was a conversation with the 1M context window enabled; this happened before the first compaction – you can see a compaction further down in the logs.

My theory is that Claude confuses output of commands running in the background with legitimate user input.

by indigodaddy6 hours ago|

parent|

prev|

[-]

The most likely explanation imv