upvote
vLLM is great at continuous batching and model serving in production, but it's a very different beast and much less versatile for the prosumer category (where we sit for our usage)

Dismissed is a strong term, but let me give you some more details.

It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.

And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.

The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.

For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.

reply
Bashing on ollama is totally warranted, since ollama is a UI skin around llama.cpp and that's it. If all you cared about was "I want to run a model and use it via an API" then the only thing it did was give you a GUI to download models (vs browsing HuggingFace yourself and downloading .gguf files yourself) and a GUI with a button labeled "run" (instead of a run.sh or run.bat script launching llama-server).

That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".

reply
One could say: vLLM isn't a worse Llama.cpp, it's a different tool
reply
AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises)

They are similar, but for different use cases.

reply
Yeah, I was a bit baffled by the author complaining about cache prefixes getting destroyed when more than one user hit the model, but then continuing to use llama.cpp instead of switching to vLLM.
reply