upvote
I'm really curious about this, not because I disagree, but because I want to avoid agents going whack. Are you running vllm for yourself only, or a for a team, or for an application, etc? And do you feel there is a minimum hardware requirement for vllm to be useful in this way?

My weekend project is going to be building a home inference server (from ancient datacenter parts) and I'm still massaging in my head what the end result will be.

reply
If I started today, with building a server, I'd jump right into verified set-ups and writeups, like this one:

https://github.com/noonghunna/club-3090

You can find info about running a patched version of vllm for 1x24gb, 2x and 4x. There's also quite a few "blackwell" subreddits, where people seem to share a lot of substantial information, if you're going the 6000 route.

reply
That writeup is completely unhinged and utterly incomprehensible to follow.

It just throws "you can do <large number>" at you, with no real explainer regarding how it manages that and which trade-offs are made. I still don't know for certain, but I think one of those trade-offs is 3 bit context? Which is a terrible idea.

Please don't share these walls of noise. They shouldn't exist

reply
Why unquantized instead of Q8 ?
reply
Noticed few cliffs. Sometimes it was a spurious stop (had to write "go on" or "continue" to restart), othertimes it was randomly saying: "Oh the user wants [the thing we already resolved]" and goes back in history. Cleared all out on fp16
reply