undefined

points

by mcintyre199412 hours ago|

[-]

Supposedly the details of the ‘jailbreak’ are that you give it insecure code and say “fix this code”, and it does, and then you ask it for test scripts and that’s effectively an exploit against the unfixed code.

If true then I have no idea how anyone’s going to release a useful model that doesn’t have the same jailbreak. https://www.theregister.com/security/2026/06/15/feds-freaked...

by handoflixue11 hours ago|

parent|

[-]

If that's the extent of the jailbreak, then the government should have banned every existing LLM - their story only makes sense if there's some Fable-specific capability that got unlocked.

by le-mark7 hours ago|

parent|

[-]

There’s no logic to it, blocking fable was retaliation and market manipulation by the current admin, nothing more. Poorly conceived as well.

by Charon7711 hours ago|

prev|

[-]

> If #2 was false, surely some other LLM lab would have done it by now.

This is a logical flaw. LLM that is immune to jailbreak _could_ exist, but not yet, or maybe nobody talks about it. Yes there's a market, but all of these AI boom is too recent to make any claims.

by gf00010 hours ago|

parent|

[-]

Like how would you even define what a jailbreak is?

by Charon7710 hours ago|

parent|

[-]

I think pretty much parallel to how social engineering, manipulation, scams work. LLMs are being trained to have human values, prioritizing human lifes, yet people are shocked it will spurt out how to make a nuclear bomb because grandma is being tied to a train track as a hostage.

by NavinF36 minutes ago|

parent|

[-]

I would also spurt out how to make a nuclear bomb (ie public information you can find using google) if I was told that's what I gotta do to save grandma tied to a train track as a hostage. "AI safety" is such a shit show.

by agos11 hours ago|

prev|

[-]

I'm pretty sure that Gödel incompleteness theorem and its consequences pretty much guarantee #2

by gwd10 hours ago|

parent|

[-]

I'm guessing you mean, the incompleteness theorem guarantees that nobody can prove their model is un-break-able?

I don't think that's quite what it means. The theorem says that it's impossible to write a function, "will_halt(program, input)", that will be correct for all possible {program, input} pairs. But for a particular program, you may be able to write a proof that it will halt for all inputs -- that's what software verification is about.

The implications here would be that nobody can create a "will_jailbreak(model, input)" function which works for all model/input pairs. But we don't need a general function which works for all model/input pairs; we just need a way to prove that for a specific model, there will be no jailbreaks for any input. As with software verification, this may require that the model be developed in a specific way.

Granted we don't currently know how to make such a proof regarding neural networks; but that's not because of Gödel.

by dgellow11 hours ago|

parent|

prev|

[-]

Mind to elaborate?

by Zababa9 hours ago|

parent|

prev|

[-]

No actually I don't think it does and I don't think they're related.

by monkey_monkey10 hours ago|

parent|

prev|

[-]

Exactly. It's impossible to guarantee #2 doesn't happen (ie protect against all jailbreaks) for any system of sufficient complexity.