upvote
This very question was asked to Nicholas Carlini from Anthropic at this talk: https://www.youtube.com/watch?v=1sd26pWhfmg

The answer is complex, worth watching the video. But mainly, they don't know where to place the line. Defenders need tools, as good as attackers. Attackers will jailbreak models, defender might not, it's the safeguard positive in that case? Carlini actively asks the audience and community for "help" in determining how to proceed basically.

reply
[dead]
reply