upvote
I continue to hope that it can be solved but, after three years, I'm beginning to lose faith that a total solution will ever be found.

I'm not a fan of the many attempted solutions that try to detect malicious prompts using LLMs or further models: they feel doomed to failure to me, because hardening the model is not sufficient in the face of adversarial attackers who will keep on trying until they find an attack that works.

The best proper solution I've seen so far is still the CaMeL paper from DeepMind: https://simonwillison.net/2025/Apr/11/camel/

reply