upvote
from my understanding, you can run the inference server (llama.cpp/vllm/whatever) and the agent/harness in different contexts, event different machines.

The risky part is in the agent/harness and what tools it has access to.

You don't need to give GPU passthrough to the VM running the agent/harness.

There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.

reply
Right. All my experiments are naïve, I am sure, but I run the LLM on the host and expose it via OpenAI API to the VMs.

This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.

I suppose in principle there is the risk of a prompt exploit corrupting the inference server.

reply