undefined

points

[-]

I think this is fundamental to any technology, including human brains.

Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.

LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.

LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.

by lupire5 hours ago|

parent|

[-]

How are you defining "banner blindness"?

The foundation of LLMs is Attention.

by warkdarrior4 hours ago|

parent|

[-]

"Banner blindness [...] describes people’s tendency to ignore page elements that they perceive (correctly or incorrectly) to be ads." https://www.nngroup.com/articles/banner-blindness-old-and-ne...

So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.

by salt40345 hours ago|

prev|

[-]

It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious.

by 3 hours ago|

parent|

[-]

deleted

by 4 hours ago|

parent|

prev|

[-]

deleted

by lupire4 hours ago|

parent|

prev|

[-]

Yes, and as you'd expect, this is how LLMs work today, in general, for control codes. But different elems use different control codes for different purposes, such as separating system prompt from user prompt.

But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.