undefined

points

[-]

GLM-5.2 performing like it would from a good provider - 8x B200s, so $450k. (No personal experience here)

GLM-5.2, severely quantised, 512GB Mac Studio, somewhere between $10k-$35k for a used M3. Or run it on a CPU with 768GB of RAM by getting an old PowerEdge with DDR4 for around $5,000.

Qwen-3.6-35b-q6, runs well on an RTX 5090 ($4000 + cost of a PC), runs medicore on an Intel Arc B70 ($1000 + cost of a PC plus lots of fiddling to get the setup to work right).

Gemma is a good candidate for the cheaper stuff, but I lack personal experience with using it locally

by jack_pp12 hours ago|

prev|

[-]

This framing local LLMs as free is stupid. Basically pay 100+ months worth of API costs up front isn't free in the slightest. And it will be slower than non-local, your hardware will be outdated in 12 months and probably won't be able to run SOTA at anywhere near non-local speed in max 20 months

by ulrikrasmussen12 hours ago|

parent|

[-]

Yeah, it glosses over a gigantic capital expenditure. It's sort of like saying that an open source modern CPU architecture allows you to build your own CPU "for free" (provided that you own and operate a fab).

by cicko12 hours ago|

parent|

prev|

[-]

True. But there are other meanings of "free". I.e. nobody can say "from now on you no longer have access to model X because you're an asshole"

by trollbridge10 hours ago|

parent|

[-]

Some obvious examples of why you'd want to spend the capital on this would be, for example, making some kind of autonomous system which needs to be periodically be offline, or you need complete confidentiality of what you're using the model for, etc.

To be cost effective with inference providers, you have to find some way to be using it 24/7.

by Der_Einzige6 hours ago|

parent|

prev|

[-]

The ecosystem for inference is centralized around a few core projects, i.e. vLLM, sglang, and llamacpp.

If they decided to collude, they could absolutely say "from now on you no longer have access to model X because you're an asshole"

The commercial inference offering are also downstream of one of those 3 projects (or trt-LLM if they're nvidia). It would impact Ollama, and fireworks, together, and everyone else.

Don't tempt fate.

by throwaway2194502 hours ago|

parent|

prev|

[-]

Hardware outdated in 12 months is FUD. What that would mean in practice would be either affordable consumer GPUs with > 32GB of VRAM, which doesn't look like it's going to happen, or unified memory systems with much higher bandwidth. That also seems unlikely.

You're better off setting a budget and buying the best machine you can afford in that range, or picking a VRAM target and accepting the class of models you can run on it. Those models will almost certainly improve over time and your skills will adapt to the limitations. Hardware is so valuable right now that it's not even likely to be a significant loss if you had to sell.

Right now I think 24 GB is probably the best bang for your buck (used 3090), because you also get a high end gaming/gpgpu device which is nice anyway. 32 GB you can do with AMD or Intel, but NVIDIA is megabucks and at this point you're really paying for RAM. Unfortunately the ship has sailed on "reasonably" priced RTX 6000s, which at one point were about $7k and are being listed at $10k++.

by bestouff12 hours ago|

prev|

[-]

The price of a small house.

by crimsoneer12 hours ago|

prev|

[-]

Practically nobody.