undefined

points

[-]

Right. All my experiments are naïve, I am sure, but I run the LLM on the host and expose it via OpenAI API to the VMs.

This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.

I suppose in principle there is the risk of a prompt exploit corrupting the inference server.