I'm not a fan of the many attempted solutions that try to detect malicious prompts using LLMs or further models: they feel doomed to failure to me, because hardening the model is not sufficient in the face of adversarial attackers who will keep on trying until they find an attack that works.
The best proper solution I've seen so far is still the CaMeL paper from DeepMind: https://simonwillison.net/2025/Apr/11/camel/