I'm personally somewhat surprised that things like system prompts get through, as that's literally a known string, not a vague "such and such are taboo concepts". I also don't see much harm in it, but given _that_ you want to block it, do you really need a whole other network for that?
FWIW by "input" I was referring to what the other commenter mentioned: it's almost certainly explicitly present in the training set. Maybe that's why "leetspeak" works -- because that's how the original authors got it past the filters of reddit, forums, etc?
If the model can really work out how to make a bomb from first principles, then they're way more capable than I thought. And, come to think of it, probably also clever enough to encode the message so that it gets through...