undefined

points

[-]

> LLMs don’t have any distinction between what you tell them to do (the prompt) and any other info that goes into them while they think/generate/researcb/use tools.

This is false as you can specify the role of the message FWIW.

by simonw4 hours ago|

parent|

[-]

Specifying the message role should be considered a suggestion, not a hardened rule.

I've not seen a single example of an LLM that can reliably follow its system prompt against all forms of potential trickery in the non-system prompt.

Solve that and you've pretty much solved prompt injection!

by koakuma-chan4 hours ago|

parent|

[-]

> The lack of a 100% guarantee is entirely the problem.

I agree, and I agree that when using models there should always be the assumption that the model can use its tools in arbitrary ways.

> Solve that and you've pretty much solved prompt injection!

But do you think this can be solved at all? For an attacker who can send arbitrary inputs to a model, getting the model to produce the desired output (e.g. a malicious tool call) is a matter of finding the correct input.

edit: how about limiting the rate at which inputs can be tried and/or using LLM-as-a-judge to assess legitimacy of important tool calls? Also, you can probably harden the model by finetuning to reject malicious prompts; model developers probably already do that.

by simonw3 hours ago|

parent|

[-]

I continue to hope that it can be solved but, after three years, I'm beginning to lose faith that a total solution will ever be found.

I'm not a fan of the many attempted solutions that try to detect malicious prompts using LLMs or further models: they feel doomed to failure to me, because hardening the model is not sufficient in the face of adversarial attackers who will keep on trying until they find an attack that works.

The best proper solution I've seen so far is still the CaMeL paper from DeepMind: https://simonwillison.net/2025/Apr/11/camel/

by jonplackett4 hours ago|

parent|

prev|

[-]

It doesn’t make much difference. Not enough anyway.

In the end all that stuff just becomes context

Read some more of you want https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

by koakuma-chan4 hours ago|

parent|

[-]

It does make a difference and does not become just context.

See https://cookbook.openai.com/articles/openai-harmony

There is no guarantee that will work 100% of the time, but effectively there is a distinction, and I'm sure model developers will keep improving that.

by simonw4 hours ago|

parent|

[-]

The lack of a 100% guarantee is entirely the problem.

If you get to 99% that's still a security hole, because an adversarial attacker's entire job is to keep on working at it until they find the 1% attack that slips through.

Imagine if SQL injection of XSS protection failed for 1% or cases.

by cruffle_duffle1 hours ago|

parent|

prev|

[-]

Correct me if I’m wrong but in general that is just some json window dressing that gets serialized into plaintext and then into tokens…. There is nothing special about the roles and stuff… at least I think. Maybe they become “magic tokens” or “special tokens” but even then they aren’t hard fast rules.

by koakuma-chan16 minutes ago|

parent|

[-]

They are special because models are trained to prioritize messages with role system over messages with role user.