upvote
It depends - for what? If your security model is sandboxing an agent to ensure they don't nuke your PC, then there are a lot of options, you can use something like bubblewrap[1] or a microVM like libkrun[2] if your goal is light-weight, up to full Docker if you want the tooling that comes with that.

[1] https://github.com/containers/bubblewrap

[2] https://github.com/libkrun/libkrun

reply
Full fat VMs with GPU passthough I trust a lot less then CPU ones.
reply
from my understanding, you can run the inference server (llama.cpp/vllm/whatever) and the agent/harness in different contexts, event different machines.

The risky part is in the agent/harness and what tools it has access to.

You don't need to give GPU passthrough to the VM running the agent/harness.

There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.

reply
Right. All my experiments are naïve, I am sure, but I run the LLM on the host and expose it via OpenAI API to the VMs.

This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.

I suppose in principle there is the risk of a prompt exploit corrupting the inference server.

reply