Dismissed is a strong term, but let me give you some more details.
It took a good 4 minutes plus to load up on the 2x 3090 rig, and served a single request 3 tokens/second slower.
And the worst bit? With all that work - setting it up and tuning it - it still looped. I was hoping "use just vLLM" advice that we get touted everywhere was the silver bullet.
The only thing I'd caution here is that we don't start bashing on llama.cpp like people did with Ollama. It's a very capable tool and for the use-cases we actually want the card for makes more sense.
For a large team replacing their Claude Subs perhaps vLLM is the only option, but you really need to add about 5 more RTX 6000 cards into the mix, so you can load something like GLM 5.2.
That's not _nothing_, but it's pretty close to nothing, and for the prosumer crowd it edges towards "just gets in the way".
They are similar, but for different use cases.