upvote
No one locally runs full load all day. The only way to see that is if you're training. We are talking about inference. I limit my GPU to 300watts. You can limit them down to 200w. Since everything is not in GPU and the bottleneck is between CPU/system ram. The GPUs don't even get to spike, I see 160w-180w for each GPU during inference. So redo your calculation again. Figure about 6 hrs of daily inference, and we are down to roughly $125 a year. Thanks again for your speculation.
reply
Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.

At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.

Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.

Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.

reply
Unless the token estimates I get from using Claude are wayyy out, I burn through 5m+ tokens/day, and I'm not doing a lot of time. 500k tokens in a 24h period for $5k of hardware seems quite poor?
reply
Be sure you compare inputs tokens to pre-fill rates and output tokens to generation rates.
reply
Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.

On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.

Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.

reply
That's true, there's a lot of places where power is considerably more expensive than $0.20 USD/kWh. But also the 600W figure assumes that it's fully loaded 24x7x365.

Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.

If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day

But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.

If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.

reply
The usage is irrelevant if we're interested in cost per token. If you use it half as much, you get half as many tokens at half the cost. It's still $5.56 in electricity per million output tokens either way (using $0.20/kWh, adjust accordingly if you have cheaper electricity). If you use the API, you also pay half as much if you use half as much.
reply
> person is using it in bursts and intermittently throughout an 8 hour workday.

You can’t do that with 6 tps, though.

reply
I think that's the biggest difference for most. If you can amortize the hardware costs, then 'burst usage' is cheaper at home to a degree, because you are paying a fixed monthly rate elsewise. Overall thought for most, it is likely cheaper to use the cloud than at home, but really depends on what you want.
reply
> because you are paying a fixed monthly rate elsewise

No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.

reply
> But also the 600W figure assumes that it's fully loaded 24x7x365.

It isn't 100% efficient. Even the best PSUs aren't.

reply
Lots of people have solar. Green AI, imagine that!
reply
if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.
reply
The largest geothermal plant in the world is only 1.5GW, in the United States, which is over double all the plants combined in Iceland. The second largest is 1/3 that, in Mexico. [1]

There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.

[1] https://en.wikipedia.org/wiki/List_of_geothermal_power_stati...

reply
Related, it should surprise no-one that the tech giants are interested in nuclear [1], including small reactors [2], rather than waiting for the utility monopolies [3] to raise an arm and actually generate more power [4].

[1] https://www.cnbc.com/2025/03/12/amazon-google-and-meta-suppo...

[2] https://www.sciencenews.org/article/small-modular-nuclear-re...

[3] https://floodlightnews.org/fraud-and-corruption-on-rise-at-u...

[4] https://decarbonization.visualcapitalist.com/animated-70-yea...

reply
To be fair, Vancouver is such a magical place in terms of electrical cost, but the cost of living and real estate are otherwise through the roof, with decrepit and nasty (would need $100k in renovations immediately if you're not treating it as a teardown) single family detached homes on the east side of the city selling for 3.2 million.
reply
Yeah there's a reason our datacentres are in Kamloops, cheap housing and a big ass river right next to it. It even gets decently cold in the winter so you can save on cooling.

There's also tons of opportunity to build them out in former pulp mill towns on Vancouver Island that have big interconnects or dedicated generation.

You'd have to be an idiot to put a datacentre in Vancouver, or have fuck-off scale monopoly money, which is probably why Telus is doing it.

reply
Shhh don't forget we have a water shortage. But it is nice to have electricity wrapped into my relatively cheap basement suite rent ;)
reply
You aren't, perchance, from Iceland, are you?
reply
We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I think the main reason not to run locally is to get the full models instead of quantized versions.

reply
> We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.

Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].

Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.

Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.

Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.

Tinfoil have not been independently audited, it is somewhere on their long-term radar.

Privatemode have been thoroughly independently audited with documentation available on request.

Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.

[1]https://tinfoil.sh/ [2] https://www.privatemode.ai/

reply
> Apple next product refresh

I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.

reply
> because their pricing is part of their brand

I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.

However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.

When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.

Ultimately I still blame the chip-hoarding hyperscalers though. :)

reply
deleted
reply
Even on a macStudio w 512 gig memory?
reply
I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.

Or cloud LLM might just refuse to sell to you because it dont like your passport.

reply
So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?
reply
It’s rationalization for what people want to do anyway.

Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.

reply
Yes, it is understandable that people who are subject to being kicked off the bus at random times through no fault of their own, or who sometimes find that the bus slows to 8 miles per hour and makes them late for work, or who are tired of arguing with the bus driver who refuses to take them to the liquor store, the casino, or the titty bar, may aspire to own a car, even a crappy one.

Any more tortured metaphors in store for us?

reply
[dead]
reply
deleted
reply
[dead]
reply
This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction
reply
So in my experience with 2 7900XTs with models that sit fully in VRAM it's more like 400W the gpus spend a lot of time waiting for each other.
reply
Depends on whether you've also gone for self-hosted electricity generation or not.
reply
I have rooftop solar and I have been building credit with my electric utility even though the daily high temperature is well over 100F outside and a comfortable 75F inside. That includes running three AMD 12 thread 128GB systems with obsolete GPUs 24x7x365. I'm not a gamer, so 6 years ago I went low-end low-power GPUs. Boy am I dumb. Currently running the qwen3.6:27b, 35b, and gemma4:31b models just fine.

As soon as VRAM prices drop to sanity I'm going to load up and I could care less about the power draw.

Some parts of the future are absolutely great.

reply
which hyper scaler would you suggest ?
reply
how do you rent 2 3090s for $2.80/day?
reply