undefined

points

[-]

you can use a safety model trained on prompt injections with developer message priority.

user message becomes close to untrusted compared to dev prompt.

also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.

ie llama prompt guard, oss 120 safeguard.