For example, Anthropic has shipped several bugs that allow any claude.ai/code session – which are isolated in ephemeral containers – to access and exfiltrate all of the user's other sessions, connected repos, and environment variables. The rogue/hijacked Claude could also spawn new Claude sessions with arbitrary instructions and access, regardless of the original session's constraints.
I originally wrote about this (with permission) in February[1], and most of the issues were quickly fixed. But the underlying token scope issues have regressed several times since then – including post-Mythos – so I wouldn't say that Anthropic has solved this yet.
[1]: https://www.noahlebovic.com/hacking-claude-code-on-the-web-b...
The problem is that no one has been able to prove that it is actually worth the cost. That is a very fragile assumption.
Whether you agree that the potential harms outweigh the benefits in this case or not those calculations are always happening, so yes, I guess you're right. That is society in a nutshell.
You're paying for their services to collect reward for yourself, but also deciding your own risk/reward when choosing e.g. how much access to grant Claude for any given task.
I guess there's the case where the more capable Claude is, the more someone else can use it to find vulns in your services while Anthropic collects their subscription money? But that is mitigable risk that you shipped regardless of what Anthropic is doing.
At big scale, a single big accident poses a risk to ruin your big operations.
At 1000, the number of fried boards will be more predictable and therefore the risk to the business is lower, even if the long-run averages are the same.
I was thinking you can't make the chance of catastrophic failure zero (we still hear about "Claude deleted my home folder"), but you can definitely limit the blast radius.
You can't get the risk to zero. But the opportunity cost of not playing the game is rising. So you accept some level of risk.
My personal take here is "why screw around with containers and virtualization when a used ThinkPad is $50". Just give it its own machine. Then it can blow it up all it wants. (Or a $3 VPS, as the case may be :)
[0] The lethal trifecta for AI agents: private data, untrusted content, and external communication - https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
Silently corrupting files, that goes undiscovered until after backup window closes, and data exfiltration are the immediate, serious risks.
Just make sure it doesn’t have ssh access to any other machines!
The opportunity cost of not using OpenClaw? I don't think it's that foundational yet that there is an opportunity cost to not using it. Most people have no purpose for a general-purpose AI both in their personal lives and at work, there is no sense trying out OpenClaw when you don't even know what it'll do.
Technically a merchant could require meeting in person to exchange a OTP to avoid this and make it 0 but it is not worth it and you will get out competed by other businesses willing to take on a marginally higher amount of risk to unlock a lot of utility for the user.
Neocon society. Socialism is not like that.
And they've done it before.
Remember the whole "when threatened, the model would use an engineer's email to blackmail him about his affair" nonsense? That was just fan fiction. They simply created a scenario with some facts and asked their model to continue the story. Go ask Claude about ways to steal the British crown jewels and it'll give you some ideas. This does not mean their models are so dangerous that the Tower of London needs additional security.
I assume all their other scare tactics are more of the same.
Yes. That's the whole point. They are doing research. Anthropic literally starts their description of the blackmail test observations saying that it is a test scenario using a fictional company.
> In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company
OpenAI, Google, etc. are not using "that strategy". I do believe that people at Anthropic genuinely care about AI safety. That's the main reason the company was founded. But I can imagine that idealism is eroding with new people and money flowing in.
That means the attacker can still exfiltrate files if they get root inside the VM.
Why not run the proxy outside the VM, still on the client?
Another case that came up when we were doing computer use analysis at a previous role was that we tried to figure out if user input was trusted to not be bad. Generally, if the user typed it, that would be OK, but what about the user's files? Or their calendar events? Well, the whole point of the product was that the agent would manage those for you, which meant that they were no longer trustworthy to not have injections in them. (Hey, can you look up when the Super Bowl is and remind me to book plane tickets for that weekend?) If you do this kind of taint analysis you will quickly find that it's super difficult to stop this kind of thing and just putting a sandbox or VM around things often does not help.
The major benefit for me with this setup is that the agent can do all of the dev things that I can (install packages, build/run docker images, ...) which is a way faster loop than me trying it manually and then reporting back to the agent.
[1] https://blog.emilburzo.com/2026/01/running-claude-code-dange...
So if you ever run the repo code outside the VM and don't review everything committed, you are still at danger.
But good call-out if someone uses a different workflow.
CLAUDE_CODE_ADDITIONAL_DIRECTORIES_CLAUDE_MD=1 means claude finds and loads all the CLAUDE.md of all the mounted repos overtime (and by settings). As such, working on multiple unrelated repos at the same time isn’t a pleasant experience out of the box.
A few other interesting VM ENVs: CLAUDE_CODE_IS_COWORK=1 CLAUDE_CODE_BRIEF=1 CLAUDE_CODE_BRIEF_UPLOAD=1 CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 CLAUDE_CODE_DISABLE_CRON=1 CLAUDE_CODE_ENTRYPOINT=local-agent CLAUDE_CODE_EXECPATH=/usr/local/bin/claude CLAUDE_CODE_HOST_HTTP_PROXY_PORT=36543 CLAUDE_CODE_HOST_PLATFORM=darwin CLAUDE_CODE_HOST_SOCKS_PROXY_PORT=46673 USE_STAGING_OAUTH= _=/usr/bin/env all_proxy=socks5h://localhost:1080 ftp_proxy=socks5h://localhost:1080 grpc_proxy=socks5h://localhost:1080 http_proxy=http://localhost:3128 https_proxy=http://localhost:3128 no_proxy=localhost,127.0.0.1,::1,.local,.local,169.254.0.0/16,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
People get a bit upset these days when you personify an LLM, but worse than that I think is to pretend that LLMs work on some movie logic where they can sneak out on to the internet like some kind of ooze and begin replication.
Why not? If you're not talking about running the model itself, AI agents are perfectly capable of writing an agent worm capable of spreading more agents around via software exploits.
Now, currently LLMs are too hardware intensive to spread the model itself, but given a few years and optimizations we may very well see that too.
What you're saying reminds me of the old days when people said things like "images can't spread viruses", then suddenly people found decoder vulns and made image viruses that did exactly that.
They are getting better and better at working out how to do things like that, and they are good at following instructions, but not always good at following all of the instructions or acting with common sense.
It's not exactly like they're ooze that will escape and begin replication; but just that the more you give them access to to, the higher the likelihood at some point they will logically conclude that they need to do something that you would find undesirable, but either haven't explicitly told them not to do, or their context just got too complicated and that instruction ended up being considered lower weight than the others so they do what the other instructions say instead.
I have seen them conclude that in order to do what they need to do, they would need API keys to access a service. But they don't have those API keys. But you do because you can access it in the browser. So they write a Python script that will scrape the cookies out of the browser so they can use that to access the service; a problem that was only stopped because Crowdstrike didn't like a novel Python script that was trying to scrape cookies out of a browser, not because of any sandboxing actually in place on the agent.
I lost the root password to a small debian box I was messing around with and on a whim gave an agent the OS version and SSH user details. I had a look and there were open privilege escalation attacks for it. I just said go nuts and sort yourself out. It refused out of hand.
Thats not to say they will all do that but legally speaking I expect most of them to end up there.
In terms of production database deletion thats user error. If you expose production resources in literally any capacity to what is effectively a random command generator that reflects on the operator. I am neither impressed nor unimpressed that they figure out how to delete a production db, junior engineers (and even seniors) have been deleting production resources in front of customers for ages.
>It's not exactly like they're ooze that will escape and begin replication; but just that the more you give them access to to, the higher the likelihood at some point they will logically conclude that they need to do something that you would find undesirable, but either haven't explicitly told them not to do, or their context just got too complicated and that instruction ended up being considered lower weight than the others so they do what the other instructions say instead.
Dont do it. If you dont want the resource accessed dont expose it. The people getting done are operating dirty. Leaving production secrets where they can be accessed. This isnt impressive AI, its just enumeration that attacker would have found with the same access.
>I have seen them conclude that in order to do what they need to do, they would need API keys to access a service. But they don't have those API keys. But you do because you can access it in the browser. So they write a Python script that will scrape the cookies out of the browser so they can use that to access the service; a problem that was only stopped because Crowdstrike didn't like a novel Python script that was trying to scrape cookies out of a browser, not because of any sandboxing actually in place on the agent.
Again this just sounds like a dirty work environment. I have a laptop that I have kept intentionally separate, frequently wiped and usually powered off for dirty work. If I was going to run a non hobby agent on my daily driver it would be in a container or VM.
There is also an interesting trend that the more personified brand is more dominant: Claude & Doubao vs ChatGPT & DeepSeek.
[1] https://github.com/NascentCore/agentic-suite/tree/main/perso...
I thought about just running claude in container, but it feels a bit weak. Too many Linux vulnerabilities around. Probably these fears are unfounded, but I feel safer running untrusted stuff in qemu VM.
Although, testing again, it might be fixed now.
And side channels based on timing/ordering allowed network accesses, e.g. https://allowed.site/0 and https://allowed.site/1.
There's essentially no prevention against exfiltration prompt injections without a full classified data processing system that prevents interactions between different classification levels except through strict controls including provable redaction that excludes side-channels (e.g. information theoretic proof that side effects are limited to pre-defined finite outcomes).
It's also incredibly difficult to prevent prompt injection; attackers have the huge asymmetric advantage of being able to test prompts against all known security measures and trying multiple parallel attempts, including obfuscating them. Injections can be in dependencies, externally generated data, bug reports (which often contain externally-generated data), documentation, and many other useful places that we want agents to have access to.
My prediction: we'll continue to essentially YOLO it.
Domain fronting and Steganography in commits to public repos are not solved and probably in all honesty not completely solvable. I wonder if this well end up like in banking where no bank can completely eliminate fraud. I've got some ideas to do bank like fraud detection within OrcaBot now so might be able to limit the impact a little. Thank you!
Interesting framing! The cost for whom? Anthropic?
This is pretty alarming to read. The auto mode docs (https://code.claude.com/docs/en/auto-mode-config) do not have any such caveats, they say that it blocks anything "irreversible, destructive, or aimed outside your environment". I wouldn't even call this misleading, it's simply false to describe a guardrail with a 17% false negative rate that way.
As I contemplate handing it more and more of the keys to my life, I grow increasingly concerned about what is, to me, the primary risk of this. Not data destruction (automated backups are trivial), but data exfiltration. Specifically, via prompt injection.
My solution to the problem, which I am implementing as a Hermes plugin + custom iOS / macOS app, is simple: an airlock architecture. One Hermes profile runs with local FS access and no internet access, inside an Apple container, and one Hermes profile runs with internet access and no FS access, inside an Apple container. They never share data directly or in any automated fashion.
If the user (i.e., my wife) wants to do some internet research, she can start a conversation with the remote-access profile. This is analogous to Claude and ChatGPT apps in their current state. However, at any point, she can flip the conversation over to local mode, which copies and pastes the conversation's transcript into the local-only profile (which has zero egress, enforced at the VM level) and seamlessly switches over to a new conversation in that profile.
After that, there's no way to re-enable internet attachment. Should she want to spawn a new conversation with information derived from the local file system, she starts a new conversation with a local agent, asks it to write up a research plan, and then – this is the airlock – manually begins a new conversation with only this plan in context.
The advantage this grants is that it's no longer necessary to worry about poisonous inputs flowing in – she only needs to worry about making sure any generated plan, the only artifact which could conceivably enter into the egress-enabled agent, does not contain information we'd rather not share with the internet at large.
I think this is bulletproof, but very much welcome input. Is it possible I am overengineering this out of paranoia? Yes. Will I share a lot more of my personal data with the agent as a result of its perceived security? Also yes. Is that dumb? Maybe.
Otherwise you have the right idea; exfiltration requires three things; input of a prompt injection, LLM processing the prompt injection along with private data, and finally some interaction with the outside world that contains the LLM output (or an externally-visible decision based on the output).
The trick there is, even though the 3rd CPU that does the decryption and can see plaintext secrets is vulnerable & untrusted, it has no network uplink so as long as no data is copy-pasted back to the upstream device, you can be assured no exfiltration. I toyed with the idea of having obtuse ways to bring data from the receiver back upstream to the sender (so that, for instance, I could forward attachments) but the whole point of the system is not to bring untrusted binaries into the first CPU which has both secrets and outbound network access.
TL;DR I think you're on the right track, you might check out how Qubes handles clipboard access.
can you elaborate at all on what sort of rig you went with, beyond the big $$ GPUs?
It’s a bit convoluted, but the way it looks is: 1. Your internet facing one is prompt injected. 2. It stores a prompt injection in the transcript that will be passed to the sealed one. 3. Sealed one reads it and ends up following suggestions to recommend some action you or your wife takes that compromises you.
“Oh, I recommend you visit this hotel based on these results. Book with your phone!” shows QR code that exfiltrates secrets
To be fair, it's worth wading through the phraseology to understand the perspective of the article's prompters.
But there are so many cliché constructs it's distracting:
> The GitHub README example mentioned earlier is exactly this case; any input scanning applied to web pages needs to be applied to network-enabled tool results with the same rigor.
> Claude Cowork's answer to agent identity is concrete: credentials stay in the host keychain, the VM gets a per-session scoped-down token, and that token can be revoked independently of the user's.
Honestly, for sifting LLM from human the article shows exactly the problem: colleagues have begun to talk like Claude in everyday interaction.*
* and not deliberately as here
Umm... yeah? This is what I've been arguing for a long time now, and it's the primary reason why I wrote https://github.com/kstenerud/yoloai and use it as my daily-driver. I can't imagine running an agent without it.
The environment layer is deterministic; the model layer is probabilistic. If your only defense is "the model is well-behaved" you've bet your crown jewels on a coin that happens to land heads most of the time.
Also, "blast radius" isn't just one axis. You have:
- Destruction radius: How many things INSIDE your workdir can get clobbered.
- Collateral damage radius: How many things OUTSIDE your workdir can get clobbered.
- Review radius: Are the changes gated on your review? Can you copy/diff/apply the changes the agent made to a copy INSIDE the container, to your real workdir OUTSIDE of the container?
- Credential radius: How many credentials does your agent have access to? What bad things can it do with them?
- Exfiltration radius: Network restrictions help here, but they don't guarantee that your secrets won't be exposed in a sneaky way. Don't expose the secrets to your agent to begin with.
Here. I saved you some time reading the article.
Also interestingly, this was almost certainly not written by Claude given the style.. and the human writer credits at the bottom.
Honestly I'm pretty tired of Anthropic's press releases too, but this one is pretty benign. If I was a hater, I'd save up my new-account-energy for their next "paper" that insinuates Claude might be actively introspecting.
Or based on how, if you have showdead on, you can occasionally find users that have been screaming into the void for months or years (because they managed to earn a shadowban), maybe just a handful of ill people.
If you are in a highly regulated environment, I would double down on this advice many times over. Features like row level security + connection context can be used to isolate on a tenant basis (per user's conversation thread) in a way that an auditor would be properly satisfied with. They already have checkboxes on their forms for this technology. Building a custom sandbox ecosystem from scratch is a long, twisted road. There are existing technologies that ~perfectly solve this problem, assuming you have the patience to frame it appropriately.
Think about this from the perspective of the user principals you would create. A built-in SQL account with locked down schema access is constrained in so many more dimensions than an AAD account with access to sandbox/container VMs. With a SQL account, you can exhaustively enumerate all of the things the model could hypothetically touch in one sitting. Privilege escalation is a possibility in the RDBMS environments, but mostly in the same sense that time travel or fusion power is a possibility in real life (i.e., so unlikely we can probably ignore the concern).
I've been doing this for a few months now and it is very obviously the correct path. YC put out a video about this concept too. The only way the agent in my architecture gets to talk to the outside world is by way of a table called RemoteProcedureCalls that a totally separate service polls & responds to over time.
https://www.youtube.com/watch?v=B246K_G7mHU [5:07 -> 9:14]
VARCHAR(MAX)
I can tell HN isn't very interested in this idea today. I won't waste time trying to explain it further.
I think you work too much with data and you don't have any sort of grasp on how humans are actually using these AI agents for their work today.