Much like how you wouldn’t immediately fire Alice, you’d train her and retest her, and see whether she had learned from her mistakes. Just don’t trust her with your sensitive data.
It’s interesting though, because the attack can be asymmetric. You could create a honeypot website that has a state-of-the-art prompt injection, and suddenly you have all of the secrets from every LLM agent that visits.
So the incentives are actually significantly higher for a bad actor to engineer state-of-the-art prompt injection. Why only get one bank’s secrets when you could get all of the banks’ secrets?
This is in comparison to targeting Alice with your spearphishing campaign.
Edit: like I said in the other comment, though, it’s not just that you _can_ fire Alice, it’s that you let her know if she screws up one more time you will fire her, and she’ll behave more cautiously. “Build a better generative AI” is not the same thing.
But we don't stop using locks just because all locks can be picked. We still pick the better lock. Same here, especially when your agent has shell access and a wallet.
We stopped eating raw meat because some raw meat contained unpleasant pathogens. We now cook our meat for the most part, except sushi and tartare which are very carefully prepared.
It is a security issue. One that may be fixed -- like all security issues -- with enough time/attention/thought&care. Metrics for performance against this issue is how we tell if we are going to correct direction or not.
There is no 'perfect lock', there are just reasonable locks when it comes to security.
If you insist on the lock analogy, most locks are easily defeated, and the wisdom is mostly “spend about the equal amount on the lock as you spent on the thing you’re protecting” (at least with e.g. bikes). Other locks are meant to simply slow down attackers while something is being monitored (e.g. storage lockers). Other locks are simply a social contract.
I don’t think any of those considerations map neatly to the “LLM divulges secrets when prompted” space.
The better analogy might be the cryptography that ensures your virtual private server can only be accessed by you.
Edit: the reason “firing” matters is that humans behave more cautiously when there are serious consequences. Call me up when LLMs can act more cautiously when they know they’re about to be turned off, and maybe when they have the urge to procreate.