undefined

upvote

points

by segmondy17 hours ago |

upvote

by effisfor9 hours ago|

[-]

I applaud all you tinkerers for pushing on the state of the home-brewed art here. Like crypto, AI is drowned out by hucksters, very few people talk about developing resilience. Or the researchers who will push on open source models in efforts to cram them onto an electric toothbrush or tamagotchi. Bravo to you all.

reply

upvote

by SwellJoe41 minutes ago|

[-]

6 tokens per second is not fit for interactive use. I find Gemma 4 (QAT 4-bit, MTP) to be tolerable at about 30 tokens per second on my old GPUs. Anything slower than 15 is annoying. I tried DS4 on my Strix halo (1-bit quantization of DeepSeek V4 Flash, the biggest model that can realistically run on 128GB, right now), and it tops out at something like 10 or 11 with a long time to first response, and that's quite painful to use. I'd definitely rather spend money to use the big models on cloud infrastructure.

And, the several thousand dollars it costs to run these things unusably slowly buys a lot of tokens on the cheap Chinese models.

reply

upvote

by discordance12 hours ago|

[-]

Running that full load is at least 600 W, so in a day ~14 kWh. At $0.2 a kWH, that would be $2.80/day or $1k a year of op-ex in electricity.

Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.

That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.

reply

upvote

by segmondy5 hours ago|

[-]

No one locally runs full load all day. The only way to see that is if you're training. We are talking about inference. I limit my GPU to 300watts. You can limit them down to 200w. Since everything is not in GPU and the bottleneck is between CPU/system ram. The GPUs don't even get to spike, I see 160w-180w for each GPU during inference. So redo your calculation again. Figure about 6 hrs of daily inference, and we are down to roughly $125 a year. Thanks again for your speculation.

reply

upvote

by walrus0111 hours ago|

[-]

Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.

At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.

Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.

Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.

reply

upvote

by Incipient2 hours ago|

[-]

Unless the token estimates I get from using Claude are wayyy out, I burn through 5m+ tokens/day, and I'm not doing a lot of time. 500k tokens in a 24h period for $5k of hardware seems quite poor?

reply

upvote

by kristjansson2 hours ago|

[-]

Be sure you compare inputs tokens to pre-fill rates and output tokens to generation rates.

reply

upvote

by discordance10 hours ago|

[-]

Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.

On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.

Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.

reply

upvote

by walrus0110 hours ago|

[-]

That's true, there's a lot of places where power is considerably more expensive than $0.20 USD/kWh. But also the 600W figure assumes that it's fully loaded 24x7x365.

Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.

If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day

But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.

If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.

reply

upvote

by nearbuy1 hours ago|

[-]

The usage is irrelevant if we're interested in cost per token. If you use it half as much, you get half as many tokens at half the cost. It's still $5.56 in electricity per million output tokens either way (using $0.20/kWh, adjust accordingly if you have cheaper electricity). If you use the API, you also pay half as much if you use half as much.

reply

upvote

by wqaatwt8 hours ago|

[-]

> person is using it in bursts and intermittently throughout an 8 hour workday.

You can’t do that with 6 tps, though.

reply

upvote

by AbsurdCensor9 hours ago|

[-]

I think that's the biggest difference for most. If you can amortize the hardware costs, then 'burst usage' is cheaper at home to a degree, because you are paying a fixed monthly rate elsewise. Overall thought for most, it is likely cheaper to use the cloud than at home, but really depends on what you want.

reply

upvote

by nomel3 hours ago|

[-]

> because you are paying a fixed monthly rate elsewise

No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.

reply

upvote

by re-thc4 hours ago|

[-]

> But also the 600W figure assumes that it's fully loaded 24x7x365.

It isn't 100% efficient. Even the best PSUs aren't.

reply

upvote

by tmountain10 hours ago|

[-]

Lots of people have solar. Green AI, imagine that!

reply

upvote

by cultofmetatron9 hours ago|

[-]

if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.

reply

upvote

by nomel3 hours ago|

[-]

The largest geothermal plant in the world is only 1.5GW, in the United States, which is over double all the plants combined in Iceland. The second largest is 1/3 that, in Mexico. [1]

There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.

[1] https://en.wikipedia.org/wiki/List_of_geothermal_power_stati...

reply

upvote

by nomel10 minutes ago|

[-]

Related, it should surprise no-one that the tech giants are interested in nuclear [1], including small reactors [2], rather than waiting for the utility monopolies [3] to raise an arm and actually generate more power [4].

[1] https://www.cnbc.com/2025/03/12/amazon-google-and-meta-suppo...

[2] https://www.sciencenews.org/article/small-modular-nuclear-re...

[3] https://floodlightnews.org/fraud-and-corruption-on-rise-at-u...

[4] https://decarbonization.visualcapitalist.com/animated-70-yea...

reply

upvote

by walrus019 hours ago|

[-]

To be fair, Vancouver is such a magical place in terms of electrical cost, but the cost of living and real estate are otherwise through the roof, with decrepit and nasty (would need $100k in renovations immediately if you're not treating it as a teardown) single family detached homes on the east side of the city selling for 3.2 million.

reply

upvote

by theeyescanner35 minutes ago|

[-]

Yeah there's a reason our datacentres are in Kamloops, cheap housing and a big ass river right next to it. It even gets decently cold in the winter so you can save on cooling.

There's also tons of opportunity to build them out in former pulp mill towns on Vancouver Island that have big interconnects or dedicated generation.

You'd have to be an idiot to put a datacentre in Vancouver, or have fuck-off scale monopoly money, which is probably why Telus is doing it.

reply

upvote

by brailsafe3 hours ago|

[-]

Shhh don't forget we have a water shortage. But it is nice to have electricity wrapped into my relatively cheap basement suite rent ;)

reply

upvote

by fghorow4 hours ago|

[-]

You aren't, perchance, from Iceland, are you?

reply

upvote

by matheusmoreira9 hours ago|

[-]

We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I think the main reason not to run locally is to get the full models instead of quantized versions.

reply

upvote

by traceroute667 hours ago|

[-]

> We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.

I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.

Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].

Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.

Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.

Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.

Tinfoil have not been independently audited, it is somewhere on their long-term radar.

Privatemode have been thoroughly independently audited with documentation available on request.

Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.

[1]https://tinfoil.sh/ [2] https://www.privatemode.ai/

reply

upvote

by patates6 hours ago|

[-]

> Apple next product refresh

I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.

reply

upvote

by traceroute666 hours ago|

[-]

> because their pricing is part of their brand

I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.

However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.

When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.

Ultimately I still blame the chip-hoarding hyperscalers though. :)

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by bawana2 hours ago|

[-]

Even on a macStudio w 512 gig memory?

reply

upvote

by SXX11 hours ago|

[-]

I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.

Or cloud LLM might just refuse to sell to you because it dont like your passport.

reply

upvote

by yorwba11 hours ago|

[-]

So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?

reply

upvote

by brookst7 hours ago|

[-]

It’s rationalization for what people want to do anyway.

Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.

reply

upvote

by CamperBob21 hours ago|

[-]

Yes, it is understandable that people who are subject to being kicked off the bus at random times through no fault of their own, or who sometimes find that the bus slows to 8 miles per hour and makes them late for work, or who are tired of arguing with the bus driver who refuses to take them to the liquor store, the casino, or the titty bar, may aspire to own a car, even a crappy one.

Any more tortured metaphors in store for us?

reply

upvote

by drptech7 hours ago|

[-]

[dead]

reply

upvote

by 11 hours ago|

[-]

deleted

reply

upvote

by drptech7 hours ago|

[-]

[dead]

reply

upvote

by swiftcoder11 hours ago|

[-]

This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction

reply

upvote

by throwawayffffas7 hours ago|

[-]

So in my experience with 2 7900XTs with models that sit fully in VRAM it's more like 400W the gpus spend a lot of time waiting for each other.

reply

upvote

by DrScientist5 hours ago|

[-]

Depends on whether you've also gone for self-hosted electricity generation or not.

reply

upvote

by downut2 hours ago|

[-]

I have rooftop solar and I have been building credit with my electric utility even though the daily high temperature is well over 100F outside and a comfortable 75F inside. That includes running three AMD 12 thread 128GB systems with obsolete GPUs 24x7x365. I'm not a gamer, so 6 years ago I went low-end low-power GPUs. Boy am I dumb. Currently running the qwen3.6:27b, 35b, and gemma4:31b models just fine.

As soon as VRAM prices drop to sanity I'm going to load up and I could care less about the power draw.

Some parts of the future are absolutely great.

reply

upvote

by poulpy1238 hours ago|

[-]

which hyper scaler would you suggest ?

reply

upvote

by dzjkb10 hours ago|

[-]

how do you rent 2 3090s for $2.80/day?

reply

upvote

by zozbot23414 hours ago|

[-]

AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.

Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.

reply

upvote

by trollbridge5 hours ago|

[-]

Particularly DeepSeek 4.1, which they appear to be A/B testing on the API and which also seems available on the free chat interface.

It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.

Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.

MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.

GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.

DeepSeek will probably keep their pricing model and just keep getting better and better.

Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.

The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.

reply

upvote

by SalariedSlave4 hours ago|

[-]

Competing and innovating in the fast moving SOTA end of the llm space requires a ruthless disregard for copyright, IP, bureaucracies, formalities, risk assurances and other slowdowns. It requires a risk tolerant, quick and large flowing investment of capital. It requires a scoped focus that is pragmatic and sharp about key concerns, and efficiently dismissive of meaningless details.

Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.

I say this as a software engineer from Europe.

reply

upvote

by trollbridge1 hours ago|

[-]

I’m not completely convinced that America and China are both lawless free for alls, and that that is what’s required for AI innovation.

reply

upvote

by leansensei3 hours ago|

[-]

Europe was never at the frontier of any tech? Huh what now?

reply

upvote

by SalariedSlave3 hours ago|

[-]

A hyperbole born of frustration, I admit.

Qualify it to software, rather than all tech, if you will.

reply

upvote

by CamperBob21 hours ago|

[-]

Not since the salad days of Nokia. Ancient history at this point.

reply

upvote

by dxuh14 hours ago|

[-]

"All it takes to run" might be fair if you paid $2400, but right now the total price is way closer to $10k (almost 5k for the RAM and 2k each for the GPUs). Today that is a lot of expensive hardware.

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by segmondy13 hours ago|

[-]

512gb 2400mhz ddr4 ram = $1600 not $5000. https://www.ebay.com/itm/188284985172 You can get creative and source 2-3 2080ti 22gb from China for about $250 a piece. You can either be resourceful and find a way or find a whole bunch of excuses.

reply

upvote

by officialchicken11 hours ago|

[-]

> You can either be resourceful and find a way or find a whole bunch of excuses.

How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.

reply

upvote

by pizza2348 hours ago|

[-]

LOL, sure this works if one has a time machine or a LOT of money to burn.

32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.

This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).

I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.

reply

upvote

by fsuts15 hours ago|

[-]

6 tokens per second?

Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones

reply

upvote

by segmondy14 hours ago|

[-]

I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

reply

upvote

by manmal14 hours ago|

[-]

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

reply

upvote

by all214 hours ago|

[-]

Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.

reply

upvote

by AussieWog939 hours ago|

[-]

Not a Local LLM user, but I regularly kick off meaty jobs in Claude Code then check on them 1-2hrs later.

reply

upvote

by wqaatwt8 hours ago|

[-]

In this case it would be 20-40 hours to accomplish the same amount in f work when running locally

reply

upvote

by Mashimo13 hours ago|

[-]

Run one task, while you do another? Or while you sleep / eat / rave?

reply

upvote

by manmal6 hours ago|

[-]

While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?

reply

upvote

by nozzlegear2 hours ago|

[-]

Do you work at Facebook and happen to find yourself in a token burning competition with your colleagues?

reply

upvote

by nijave6 hours ago|

[-]

I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.

reply

upvote

by Mashimo5 hours ago|

[-]

Why would you use this when your company has access to actual SOTA? I don't get it.

reply

upvote

by segmondy4 hours ago|

[-]

Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?

Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.

reply

upvote

by froh14 hours ago|

[-]

do you use caveman or similar?

reply

upvote

by walrus0111 hours ago|

[-]

I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.

reply

upvote

by radku10 hours ago|

[-]

I have pretty much almost this exact setup with 2x3090s and with slightly faster DDR4 512GB and 64 core Epyc! [0] I've been enjoying it a lot. Can't wait to give this model a try.

Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.

[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...

reply

upvote

by nextaccountic15 hours ago|

[-]

How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?

Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?

I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software

reply

upvote

by nodja14 hours ago|

[-]

Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.

reply

upvote

by xrd13 hours ago|

[-]

This is a good place to start reading about dual gpus.

https://github.com/noonghunna/club-3090/blob/master/docs/DUA...

reply

upvote

by nextaccountic13 hours ago|

[-]

But in this case he used a cpu too

reply

upvote

by segmondy14 hours ago|

[-]

checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.

reply

upvote

by edg500015 hours ago|

[-]

Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).

reply

upvote

by segmondy14 hours ago|

[-]

you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.

reply

upvote

by nijave6 hours ago|

[-]

Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.

Cloud offerings are 80-200tk/sec versus single digit tk/sec.

That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.

reply

upvote

by edg50005 hours ago|

[-]

I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.

reply

upvote

by ikari_pl4 hours ago|

[-]

I can work out max 90GB to the agents. Advise. :)

reply

upvote

by redox9917 hours ago|

[-]

That's crazy good for $2400.

reply