undefined

points

[-]

This is something inherently hard to avoid with a prompt. The model is instruction-tuned and trained to interpret anything sent under the user role as an instruction, not necessarily in a straightforward manner. Even if you train it to refuse or dodge some inputs (which they do), it's going to affect model's response, often in subtle ways, especially in a multiturn convo. Anthropic themselves call this the character drift.