These are contradictory cases. If you put guardrails into the system prompt, you've anticipated that the AI will take the action you're guardrailing against. And since AI prompt compliance is at best stochastic (and realistically just crap, over large sample sizes), every guardrail is an explicit recognition of a failure -- the guardrail will be ignored, and you can't pretend you didn't realize it was a problem, since you put it in.
The best comparison I can think of is that it's like validating dats on the frontend; it can make for a better user experience and he more efficient than hitting the backend when you know it will be an error, but it's not protection in any meaningful sense, and if you're not also enforcing invariants from behind the API, you're going to have a bad time. This is pretty similar to the type of issues you might run into with an implementation like that, where someone might make a request with data that you wouldn't expect from your frontend and perform operations you didn't mean to allow.
It might be bad to have it if the user can obtain the system prompt and make note of any advisories as potential weaknesses.
This looks like a terrible design rather than an AI problem to me, though.
An AI enabled terrible design. AI acted as a black box of stupidity, that obscured the stupidity of the design.
Humans do get fooled but it usually takes far more effort than that because a human service rep can learn and is worried about having a job tomorrow.
Do we actually know that a human was in the loop before and that the human judgement was replaced by an LLM? Or is that pure speculation?
I have certainly seen account reclamation flows that allowed providing a new email address (but usually with better safeguards).
https://www.meta.com/account-recovery-support/ai-support-ass...
Now, it’s possible that they instead moved it to human workers and simultaneously forgot everything they’d learned about security or training, but that seems unlikely.
I can think of several pre-2000s chat rooms that did EXACTLY this. It is how I lost several chat accounts as a teenager.