undefined

points

[-]

It looks as if this tool has traditional static rules to allow/deny requests, as well as a secondary LLM-as-a-judge layer for, I imagine, the kinds of rules that would be messy or too convoluted to implement using standard rules.

by stingraycharles11 hours ago|

parent|

[-]

I think the parent’s point is that this should be implemented using e.g. Bayesian statistics rather than an LLM, as the judge LLM is vulnerable to the exact same types of attacks that it’s trying to protect against.

Most proper LLM guardrails products use both.

by snug14 hours ago|

prev|

[-]

I think this can be great as additional layer of security. Where you can have a non llm layer do some analysis with some static rules and then if something might seem phishy run it through the llm judge so that you don’t have to run every request through it, which would be very expensive.

Edit: actually looks like it has two policy engines embedded

by windexh8er14 hours ago|

parent|

[-]

And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?

by lukewarm7073 hours ago|

parent|

[-]

you can use a safety model trained on prompt injections with developer message priority.

user message becomes close to untrusted compared to dev prompt.

also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.

ie llama prompt guard, oss 120 safeguard.

by ImPostingOnHN14 hours ago|

parent|

prev|

[-]

What happens when a prompt injection attack exploits the judge LLM and results in a higher level of attacker control than if it never existed?

by vova_hn213 hours ago|

parent|

[-]

How can it result in a higher level of control? I don't see why the "judge" should have access to anything except one tool that allows it to send an "accept" or "deny" command.

by nl13 hours ago|

prev|

[-]

> We’re supposed to be fixing LLM security by adding a non-LLM layer to it,

If people said "we build a ML-based classifier into our proxy to block dangerous requests" would it be better? Why does the fact the classifier is a LLM make it somehow worse?

by Retr0id13 hours ago|

parent|

[-]

The fact that LLMs are "smarter" is also their weakness. An oldschool classifier is far from foolproof, but you won't get past it by telling it about your grandma's bedtime story routine.

by reassess_blind11 hours ago|

parent|

[-]

Fairly hard to bypass the latest LLMs with grandma's bedtime story these days, to be fair.

by Retr0id11 hours ago|

parent|

[-]

That specific trick yes, but the general concept still applies.

by reassess_blind11 hours ago|

parent|

[-]

It does, but it's certainly not trivial. In fact there's an unclaimed $1000 bounty on prompt injecting OpenClaw: https://hackmyclaw.com/

by DANmode9 hours ago|

parent|

[-]

Is that enough?

by reassess_blind3 hours ago|

parent|

[-]

Enough for what?

by waterTanuki13 hours ago|

parent|

prev|

[-]

If you're working in a mission-critical field like healthcare, defense, etc. you need a way to make static and verifiable guarantees that you can't leak patient data, fighter jet details etc. through your software. This is either mandated by law or in your contract details.

The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.

You might say this hasn't prevented leaks/CVEs in exisiting mission-critical software and this would be correct. However, the people writing the checks do not care. You get paid as long as you follow the spec provided. How then, in a world which demands rigorous proof do you fit in an LLM judge?

by nl10 hours ago|

parent|

[-]

> The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.

This is exactly the point though. A LLM is great at finding work-around for static defenses. We need something that understands the intent and responds to that.

Static rules are insufficient

by SkyPuncher14 hours ago|

prev|

[-]

Defense in depth. Layers don't inherently make something less secure. Often, they make it more secure.

by yakkomajuri14 hours ago|

parent|

[-]

I do think this is likely to make things more secure but it's also dangerous by potentially giving users a false sense of complete security when the security layer is probabilistic rather than deterministic.

EDIT: it does seem to have a deterministic layer too and I think that's great