And, the several thousand dollars it costs to run these things unusably slowly buys a lot of tokens on the cheap Chinese models.
Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.
That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.
At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.
Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.
Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.
On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.
Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.
Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.
If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day
But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.
If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.
You can’t do that with 6 tps, though.
No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.
It isn't 100% efficient. Even the best PSUs aren't.
There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.
[1] https://en.wikipedia.org/wiki/List_of_geothermal_power_stati...
[1] https://www.cnbc.com/2025/03/12/amazon-google-and-meta-suppo...
[2] https://www.sciencenews.org/article/small-modular-nuclear-re...
[3] https://floodlightnews.org/fraud-and-corruption-on-rise-at-u...
[4] https://decarbonization.visualcapitalist.com/animated-70-yea...
There's also tons of opportunity to build them out in former pulp mill towns on Vancouver Island that have big interconnects or dedicated generation.
You'd have to be an idiot to put a datacentre in Vancouver, or have fuck-off scale monopoly money, which is probably why Telus is doing it.
I think the main reason not to run locally is to get the full models instead of quantized versions.
I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.
Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].
Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.
Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.
Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.
Tinfoil have not been independently audited, it is somewhere on their long-term radar.
Privatemode have been thoroughly independently audited with documentation available on request.
Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.
I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.
I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.
However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.
When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.
Ultimately I still blame the chip-hoarding hyperscalers though. :)
Or cloud LLM might just refuse to sell to you because it dont like your passport.
Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.
Any more tortured metaphors in store for us?
As soon as VRAM prices drop to sanity I'm going to load up and I could care less about the power draw.
Some parts of the future are absolutely great.
Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.
It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.
Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.
MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.
GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.
DeepSeek will probably keep their pricing model and just keep getting better and better.
Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.
The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.
Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.
I say this as a software engineer from Europe.
Qualify it to software, rather than all tech, if you will.
How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.
32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.
This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).
I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.
Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones
Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.
Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.
[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...
Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?
I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software
https://github.com/noonghunna/club-3090/blob/master/docs/DUA...
Cloud offerings are 80-200tk/sec versus single digit tk/sec.
That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.