upvote
If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.
reply
The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

reply
> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

reply
Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.
reply
I think this is the play
reply
> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

reply
the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.
reply
The solution for high prices is high prices.

If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.

reply
What's the irrational part? There's sky high demand.
reply
maybe the irrational part is the amount of demand for consumer hardware, wouldn't the market for professional ML/AI used hardware go away from consumer hardware over time? (I can talk more about what I mean consumer hardware to be)

Also irrational parts of this market (would love to hear your thoughts):

- the purchase of hardware that isn't power efficient or gives an ROI for ML/AI use cases by companies buying it, who would be priced out of using that hardware over time

- many people and companies are buying the hardware due to hype and scarcity/FOMO reasons over rational reasons

reply
Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.
reply
is it possible that ai companies ordered a bunch of ram just so that models cannot be run locally? they are betting new fabs wont be built before quantum takes hold.
reply
When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.
reply
I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

reply
> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.

reply
Do we know where those key players went?
reply
Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.
reply
I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.

For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.

reply
Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.
reply
> LLM provider that doesnt store or sell their queries

> As long as that is allowed to happen

It won't be. Only we can provide that, and only for ourselves.

reply
Where is $50k coming from again?
reply
That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

reply
The hardware requirements aren't evolving and the local models have only been improving.

It's not like you'd lose capabilities, if anything this solution just gets better with time.

reply
If the newer models require more/better hardware then you’ll lose capabilities.

I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.

reply
The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.

The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.

reply
Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.

reply
$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.

You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.

reply
You don't need all of the model in VRAM. 1 or 2 RTX Pro 6000s will do. $50K will get you there very nicely, and on a 1600 watt PSU if you go for the MAX-Q versions. (The same wattage PSU I'm typing this on, and have been using over the last 5 years.)
reply
As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.

reply
You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.
reply
The RAM requirements are still pretty painful.
reply
equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

reply
Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.
reply
Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity
reply
This really sounds like “640kb should be enough”.

I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.

reply
We've already hit RAM power and size limits (about 40k electrons which is the limit before we get noise messing up the amplifier).

If a model needs 2x more memory, but serves the same number of customers, the cost is going to go up per customer to cover the increased hardware and power costs. Companies are starting to implement AI limits to keep costs under control.

Anthropic and OpenAI are rumored to be considering cutting inference prices trying to retain customers as LLMs commoditize and race to the bottom. It reminds me of the Chinese bike wars where bike-share companies were losing massive amounts of money, but kept running sales and lowering prices in an attempt to compete and drive out their competitors. The end of that story was a bunch of major bankruptcies and giant bike graveyards.

Nvidia's hard pivot to "in the near future, everyone will run their AI at home" seems to indicate that they also see the market shifting. We've already had AI ingest everything out there. The real challenge becomes how to better optimize their algorithm to get more useful data in less space.

reply
have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range

a lot of innovation occurring

reply
It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

reply
It depends what you’re using it for. Real time interactive Claude code session? No, it’s kind of impractical.

But if you already have agent loops dialed in (For example I have one that uses a browser testing framework), it wouldn’t really affect me at all if it crunched away at 7 tokens per second all night long.

reply
Not really, you can do great things without them. I've been summarizing hundreds of documents. I've added MCP servers to my internal business tools (Elixir apps) and can chat with the Nous Hermes agent over Telegram about pending orders, inventory level, historical product prices, etc., without having to click/dick around with a web UI.

Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!

reply
The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.
reply
Honestly, Qwen3.6 is already what you need for the large majority of tasks.

(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).

reply
locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

reply
Yeah, 20 months of Claude Max until they rugpull you. I’m spending 7-10k/month in raw token costs on Claude Max. Having an alternative is a nice insurance policy.
reply
Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.
reply
> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.

You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.

reply
You can get a 128GB Strix Halo for under $3k. Used to be under $2k. Even if you believe it'll be completely obsolete for AI in two years, it'll still be good for many other things. Games for at least several more years, a great home server and/or desktop almost indefinitely. Plus, we might actually reach good enough levels for some AI use cases, if we're not already there.

And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.

reply
people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already
reply
You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.

Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.

On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.

reply
Why do you think they are rushing to IPO!!
reply