undefined

points

[-]

This was a problem with early telephone lines which was easy to exploit (see Woz & Jobs Blue Box). It got solved by separating the voice and control pane via SS7. Maybe LLMs need this separation as well

by bcrosby956 hours ago|

parent|

[-]

This is where the old line of "LLMs are just next token predictors" actually factors in. I don't know how you get a next token predictor that user input can't break out of. The answer is for the implementer to try to split what they can, and run pre/post validation. But I highly doubt it will ever be 100%, its fundamental to the technology.

by miki1232115 hours ago|

parent|

[-]

I think this is fundamental to any technology, including human brains.

Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.

LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.

LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.

by lupire4 hours ago|

parent|

[-]

How are you defining "banner blindness"?

The foundation of LLMs is Attention.

by warkdarrior4 hours ago|

parent|

[-]

"Banner blindness [...] describes people’s tendency to ignore page elements that they perceive (correctly or incorrectly) to be ads." https://www.nngroup.com/articles/banner-blindness-old-and-ne...

So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.

by salt40345 hours ago|

parent|

prev|

[-]

It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious.

by 3 hours ago|

parent|

[-]

deleted

by 4 hours ago|

parent|

prev|

[-]

deleted

by lupire4 hours ago|

parent|

prev|

[-]

Yes, and as you'd expect, this is how LLMs work today, in general, for control codes. But different elems use different control codes for different purposes, such as separating system prompt from user prompt.

But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.

by VikingCoder7 hours ago|

prev|

[-]

The "S" in "LLM" is for "Security".

by notatoad6 hours ago|

prev|

[-]

As the article says: this doesn’t necessarily appear to be a problem in the LLM, it’s a problem in Claude code. Claude code seems to leave it up to the LLM to determine what messages came from who, but it doesn’t have to do that.

There is a deterministic architectural boundary between data and control in Claude code, even if there isn’t in Claude.

by Latty3 hours ago|

parent|

[-]

That's a guess by the article author and frankly I see no supporting evidence for it. Wrapping "<NO THIS IS REALLY INPUT FROM THE USER OK>" tags around it or whatever is what I'm describing: you can do as much signalling as you want, but at the end of the day the LLM can ignore it.

by letmevoteplease5 hours ago|

parent|

prev|

[-]

Can you elaborate? As far as I understand, for each message, the LLM is fed the entire previous conversation with special tokens separating the user and LLM responses. The LLM is then entrusted with interpreting the tokens correctly. I can't imagine any architecture where the LLM is not ultimately responsible for determining what messages came from who.

by mt_8 hours ago|

prev|

[-]

Exactly like human input to output.

by WarmWash7 hours ago|

parent|

[-]

We just need to figure out the qualia of pain and suffering so we can properly bound desired and undesired behaviors.

by ACCount376 hours ago|

parent|

[-]

Ah, the Torment Nexus approach to AI development.

by BoneShard6 hours ago|

parent|

prev|

[-]

this is probably the shortest way to AGI.

by codebje8 hours ago|

parent|

prev|

[-]

Well no, nothing like that, because customers and bosses are clearly different forms of interaction.

by vidarh8 hours ago|

parent|

[-]

Just like that, in that that separation is internally enforced, by peoples interpretation and understanding, rather than externally enforced in ways that makes it impossible for you to, e.g. believe the e-mail from an unknown address that claims to be from your boss, or be talked into bypassing rules for a customer that is very convincing.

by codebje8 hours ago|

parent|

[-]

Being fooled into thinking data is instruction isn't the same as being unable to distinguish them in the first place, and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.

by TeMPOraL7 hours ago|

parent|

[-]

> and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.

This is literally what "prompt injection" is. The sooner people understand this, the sooner they'll stop wasting time trying to fix a "bug" that's actually the flip side of the very reason they're using LLMs in the first place.

by vidarh7 hours ago|

parent|

prev|

[-]

This makes no sense to me. Being fooled into thinking data is instruction is exactly evidence of an inability to reliably distinguish them.

And being coerced or convinced to bypass rules is exactly what prompt injection is, and very much not uniquely human any more.

by kg7 hours ago|

parent|

[-]

The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works. Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss. Many defense mechanisms used in corporate email environments are built around making sure the email from your boss looks meaningfully different in order to establish that data vs instruction separation. (There are social engineering attacks that would work in-person though, but I don't think it's right to equate those to LLM attacks.)

Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.

by vidarh7 hours ago|

parent|

[-]

> The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works.

Yes, that is exactly the point.

> Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss.

Irrelevant, as other attacks works then. E.g. it is never a given that your bosses instructions are consistent with the terms of your employment, for example.

> Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.

It is very much "convincing", yes. The ability to convince an LLM is what creates the effective lack of separation. Without that, just using "magic" values and a system prompt telling it to ignore everything inside would create separation. But because text anywhere in context can convince the LLM to disregard previous rules, there is no separation.

by PunchyHamster6 hours ago|

parent|

prev|

[-]

the second leads to first, in case you still don't realize

by orbital-decay7 hours ago|

parent|

prev|

[-]

These are different "agents" in LLM terms, they have separate contexts and separate training

by j458 hours ago|

parent|

prev|

[-]

There can be outliers, maybe not as frequent :)

by jodrellblank7 hours ago|

parent|

prev|

[-]

If they were 'clearly different' we would not have the concept of the CEO fraud attack:

https://www.barclayscorporate.com/insights/fraud-protection/...

That's an attack because trusted and untrusted input goes through the same human brain input pathways, which can't always tell them apart.

by runarberg7 hours ago|

parent|

[-]

Your parent made no claim about all swans being white. So finding a black swan has no effect on their argument.

by jodrellblank5 hours ago|

parent|

[-]

My parent made a claim that humans have separate pathways for data and instructions and cannot mix them up like LLMs do. Showing that we don't has every effect on refuting their argument.

>>> The principal security problem of LLMs is that there is no architectural boundary between data and control paths.

>> Exactly like human input to output.

> no nothing like that

but actually yes, exactly like that.

by ummonk2 hours ago|

prev|

[-]

I don't see why the transformer architecture can't be designed and trained with separate inputs for control data and content data.

by amw-zero2 hours ago|

parent|

[-]

Give it a shot

by groby_b6 hours ago|

prev|

[-]

"The principal security problem of von Neumann architecture is that there is no architectural boundary between data and control paths"

We've chosen to travel that road a long time ago, because the price of admission seemed worth it.

by clickety_clack8 hours ago|

prev|

[-]

It’s easier not to have that separation, just like it was easier not to separate them before LLMs. This is architectural stuff that just hasn’t been figured out yet.

by fzeindl8 hours ago|

parent|

[-]

No.

With databases there exists a clear boundary, the query planner, which accepts well defined input: the SQL-grammar that separates data (fields, literals) from control (keywords).

There is no such boundary within an LLM.

There might even be, since LLMs seem to form adhoc-programs, but we have no way of proving or seeing it.

by TeMPOraL7 hours ago|

parent|

[-]

There cannot be, without compromising the general-purpose nature of LLMs. This includes its ability to work with natural languages, which as one should note, has no such boundary either. Nor does the actual physical reality we inhabit.

by hnuser1234565 hours ago|

parent|

prev|

[-]

There is a system prompt, but most LLMs don't seem to "enforce" it enough.

by embedding-shape5 hours ago|

parent|

[-]

Since GPS-OSS there is also the Harmony response format (https://github.com/openai/harmony) that instead of just having a system/assistant/user split in the roles, instead have system/developer/user/assistant/tool, and it seems to do a lot better at actually preventing users from controlling the LLM too much. The hierarchy basically becomes "system > developer > user > assistant > tool" with this.