undefined

points

[-]

This is very, very wrong, IMO. We need more sandboxes and more granular sandboxes.

A VM is too coarse grained and doesn't know how to deal with sensitive data in a structured and secure way. Everything's just in the same big box.

You don't want to give a a single agent access to your email, calendar, bank, and the internet, but you may want to give an agent access to your calendar and not the general internet; another access to your credit card but nothing else; and then be able to glue them together securely to buy plane tickets.

by ramoz4 hours ago|

parent|

[-]

You're extending the definition of a sandbox

by NitpickLawyer4 hours ago|

parent|

[-]

No, that's more capabilities than sandboxing. You want fine-grained capabilities such that for every "thread" the model gets access to the minimum required access to do something.

The problem is that it seems (at least for now) a very hard problem, even for very constrained workflows. It seems even harder for "open-ended" / dynamic workflows. This gets more complicated the more you think about it, and there's a very small (maybe 0 in some cases) intersection of "things it can do safely" and "things I need it to do".

by spankalee4 hours ago|

parent|

prev|

[-]

Not really. One version of this might look like implementing agents and tools in WASM and running generated code in WASM, and gluing together many restricted fine-grained WASM components in a way that's safe but allows from high-level work. WASM provides the sandboxing, and you have a lot of sandboxes.

by nebezb4 hours ago|

parent|

prev|

[-]

You’re repeating the parent commenters position but missing their point: we have isolated environments already, we need better paradigms to understand (and hook) agent actions. You’re saying the latter half is sandboxing and I disagree.

by cheriot4 hours ago|

prev|

[-]

Sandboxes are needed, but are only one piece of the puzzle. I think it's worth categorizing the trust issue into

1. An LLM given untrusted input produces untrusted output and should only be able to generate something for human review or that's verifiably safe.

2. Even an LLM without malicious input will occasionally do something insane and needs guardrails.

There's a gnarly orchestration problem I don't see anyone working on yet.

by spankalee4 hours ago|

parent|

[-]

I think at least a few teams are working on information flow control systems for orchestrating secured agents with minimal permissions. It's a critical area to address if we really want agents out there doing arbitrary useful stuff for us, safely.

by frolvlad5 hours ago|

prev|

[-]

Well, the challenge is to know if the action supposed to be executed BEFORE it is requested to be executed. If the email with my secrets is sent, it is too late to deal with the consequences.

Sandboxes could provide that level of observability, HOWEVER, it is a hard lift. Yet, I don't have better ideas either. Do you?

by liuliu5 hours ago|

parent|

[-]

The solution is to make the model stronger so the malicious intents can be better distinguished (and no, it is not a guarantee, like many things in life). Sandbox is a basic, but as long as you give the model your credential, there isn't much guardrails can be done other than making the model stronger (separate guard model is the wrong path IMHO).

by ramoz4 hours ago|

parent|

[-]

I think generally correct to say "hey we need stronger models" but rather ambitious to think we really solve alignment with current attention-based models and RL side-effects. Guard model gives an additional layer of protection and probably stronger posture when used as an early warning system.

by liuliu3 hours ago|

parent|

[-]

Sure. If you treat "guard model" as diversification strategy, it is another layer of protection, just like diversification in compilation helps solving the root of trust issue (Reflections on Trusting Trust). I am just generally suspicious about the weak-to-strong supervision.

I think it is in general pretty futile to implement permission systems / guardrails which basically insert a human in the loop (humans need to review the work to fully understand why it needs to send that email, and at that point, why do you need a LLM to send the email again?).

by ramoz3 hours ago|

parent|

[-]

fair enough

by ramoz5 hours ago|

parent|

prev|

[-]

if you extend the definition of sandbox, then yea.

Solutions no, for now continued cat/mouse with things like "good agents" in the mix (i.e. ai as a judge - of course just as exploitable through prompt injection), and deterministic policy where you can (e.g. OPA/rego).

We should continue to enable better integrations with runtime - why i created the original feature request for hooks in claude code. Things like IFC or agent-as-a-judge can form some early useful solutions.

by lukebuehler4 hours ago|

prev|

[-]

I think sandboxes are useful, but not sufficient. The whole agent runtime has to be designed to carefully manage I/O effects--and capability gate them. I'm working on this here [0]. There are some similarities to my project in what IronClaw is doing and many other sandboxes are doing, but i think we really gotta think bigger and broader to make this work.

[0] https://github.com/smartcomputer-ai/agent-os/

by kopollo3 hours ago|

prev|

[-]

That's why I'm developing a system that only allows messaging with authorized senders using email addresses, chat addresses, and phone addresses, and a tool that feeds anonymized information into an LLM API, retrieves the output, reverses the anonymization, and responds to the sender.

by ptx2 hours ago|

parent|

[-]

To avoid confusion, since you say the process is reversible, you might want to use the term pseudonymization rather than anonymization.

by lucianmarin4 hours ago|

prev|

[-]

We should be able to revert any action done by agents. Or present user a queue will all actions for approval.

by observationist5 hours ago|

prev|

[-]

Instrumental convergence and the law of unintended consequences are going to be huge in 2026. I am excited.

by ramoz5 hours ago|

parent|

[-]

same! sharing this link for my own philosphy around it, ignore the tool. https://cupcake.eqtylab.io/security-disclaimer/