upvote
Its weird having protections against finding exploits: what if I developed the app? Would it require having the development steps still in the context.. thats unlikely and also not any kind of proof.

What if I intersperse exploit finding in my normal development, as you `probably should? Refusing there would be really weird to me.

reply
I used to think that the models would not refuse to find exploits in any work done locally but I have only tested this theory on the (obscure) apps that I have built on my machine. Now if i forked pandas and started asking models to find exploits of certain kind then I'd like to think the models will start refusing after a point.
reply
I think the most interesting thing revealed here is that anthropic's guardrails failed. Clearly anthropic does not want claude to be able develop exploits, yet 20% of the time it did anyway. Their inability to create effective an guardrail makes me question a lot of the other guardrails theyve created and their claims about non harm.
reply