undefined

points

[-]

> The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.

> Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.

This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.

by Aurornis1 hours ago|

parent|

[-]

SSD streaming throughput is too slow to be usable.

GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.

If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.

So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.

You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.

If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.

by CuriouslyC57 minutes ago|

parent|

[-]

You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.

by searealist8 minutes ago|

parent|

[-]

It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.

by rsalus24 minutes ago|

parent|

prev|

[-]

surely the supply of unified memory will rise to meet demand before this is needed

by odo12421 hours ago|

prev|

[-]

This is similar to my experience with (8-bit quantized, non-MOE, 26b) Qwen locally on my computer. It’s really good for small tasks, but the first time I tried to do a major task with it it straight up forgot what agent harness it was in and started using the wrong format for tool calls lol

(If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)

by FuckButtons1 hours ago|

prev|

[-]

I’ve found ds4 on my mbp to be very useful, bought before ram prices became insane. It’s not writing entire applications on it’s own, it has resolved annoying networking issues on my tailnet that I had neither the time nor inclination to figure out on my own and I often find myself reaching for it for simple but annoyingly research intensive tasks that I wouldn’t have otherwise gotten to. Is it opus? No, but is it useful? absolutely and I don’t have to worry about whether or not I’m getting value out of a subscription or the api cost of using it.

by vient38 minutes ago|

prev|

[-]

Wonder if AMD MI350P release will affect setups like this. From what I've heard, the price will be pretty similar to RTX PRO 6000 while having 50% more VRAM which is additionally an HBM3E instead of GDDR7.

by bloat55 minutes ago|

prev|

[-]

They do say the cards were purchased when they were cheaper. They debuted at less than nine grand apparently.

by ttoinou2 hours ago|

prev|

[-]

Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP quality

by CamperBob23 hours ago|

prev|

[-]

All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.

The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.

It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.

Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.

by Aurornis3 hours ago|

parent|

[-]

> It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.

The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.

You will almost certainly never break even compared to paying per token.

Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.

by jobeirne3 hours ago|

parent|

[-]

Or if you want to hedge against the various tail risks of third-party providers raising prices or denying you service or somehow abusing your data...

by Aurornis3 hours ago|

parent|

[-]

> hedge against the various tail risks of third-party providers raising prices

They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.

> or denying you service

I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.

> or somehow abusing your data...

If data security is your concern then you’re better renting a server as needed still.

If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!

by incrudible2 hours ago|

parent|

prev|

[-]

Raising prices is not a tail risk, anything a local LLM setup can do for you can be done by any cloud provider, with the same capex as yours (or less), there is no moat here, so it is highy price competitive and will remain so. If you want to speculate on hardware shortages, that is a different business altogether and you need no janky garage setup to profit.

by CamperBob23 hours ago|

parent|

prev|

[-]

Also agreed, it's definitely a sucker's game to run a high-end model locally, by any objective measure.

Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.