upvote
I retract that.

I think what I meant to say was, they're as simple to jailbreak as they were three years ago.

Different methods, still simple. Working with researchers that are able to get very explicit things out of them. Again, it feels much worse than before, given the capability of these models.

There's basically guardrails encoded into the fine-tuned layers that you can essentially weave through (prompting). These 'guardrails' are where they work hard for benevolent alignment, yet where it falls short (but enables exceptional capability alignment). Again, nothing really different than it was three years ago.

reply