undefined

upvote

points

by pheggs20 hours ago |

upvote

by UncleOxidant20 hours ago|

[-]

If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.

reply

upvote

by gpm19 hours ago|

[-]

The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

reply

upvote

by UncleOxidant19 hours ago|

[-]

> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

reply

upvote

by DougN716 hours ago|

[-]

Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.

reply

upvote

by r0b058 hours ago|

[-]

I think this is the play

reply

upvote

by mannanj19 hours ago|

[-]

> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

reply

upvote

by colinsane17 hours ago|

[-]

the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.

reply

upvote

by selectodude17 hours ago|

[-]

The solution for high prices is high prices.

If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.

reply

upvote

by Tepix5 hours ago|

[-]

What's the irrational part? There's sky high demand.

reply

upvote

by mannanj2 hours ago|

[-]

maybe the irrational part is the amount of demand for consumer hardware, wouldn't the market for professional ML/AI used hardware go away from consumer hardware over time? (I can talk more about what I mean consumer hardware to be)

Also irrational parts of this market (would love to hear your thoughts):

- the purchase of hardware that isn't power efficient or gives an ROI for ML/AI use cases by companies buying it, who would be priced out of using that hardware over time

- many people and companies are buying the hardware due to hype and scarcity/FOMO reasons over rational reasons

reply

upvote

by elorant19 hours ago|

[-]

Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.

reply

upvote

by bawana2 hours ago|

[-]

is it possible that ai companies ordered a bunch of ram just so that models cannot be run locally? they are betting new fabs wont be built before quantum takes hold.

reply

upvote

by dannyw16 hours ago|

[-]

When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.

reply

upvote

by verdverm19 hours ago|

[-]

I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

reply

upvote

by UncleOxidant17 hours ago|

[-]

> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.

reply

upvote

by Infernal17 hours ago|

[-]

Do we know where those key players went?

reply

upvote

by verdverm16 hours ago|

[-]

Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.

reply

upvote

by cogman1020 hours ago|

[-]

I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.

For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.

reply

upvote

by twelvechairs20 hours ago|

[-]

Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.

reply

upvote

by matheusmoreira8 hours ago|

[-]

> LLM provider that doesnt store or sell their queries

> As long as that is allowed to happen

It won't be. Only we can provide that, and only for ourselves.

reply

upvote

by eventualcomp20 hours ago|

[-]

Where is $50k coming from again?

reply

upvote

by stingraycharles20 hours ago|

[-]

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

reply

upvote

by cogman1020 hours ago|

[-]

The hardware requirements aren't evolving and the local models have only been improving.

It's not like you'd lose capabilities, if anything this solution just gets better with time.

reply

upvote

by chatmasta19 hours ago|

[-]

If the newer models require more/better hardware then you’ll lose capabilities.

I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.

reply

upvote

by cogman1018 hours ago|

[-]

The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.

The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.

reply

upvote

by dannyw16 hours ago|

[-]

Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth.

TPS = active weights in GB / your memory bandwidth.

That’s it for decode. That’s all.

reply

upvote

by Tepix5 hours ago|

[-]

$50K seems low if you want to run, say, GLM 5.2 4bit fast enough for a team for devs.

You need something like 6x RTX Pro 6000 at $11800 each plus a nice server (add $10000) = $80800 and then quite a bit of electricity.

reply

upvote

by theYipster1 hours ago|

[-]

You don't need all of the model in VRAM. 1 or 2 RTX Pro 6000s will do. $50K will get you there very nicely, and on a 1600 watt PSU if you go for the MAX-Q versions. (The same wattage PSU I'm typing this on, and have been using over the last 5 years.)

reply

upvote

by cogman1020 hours ago|

[-]

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.

reply

upvote

by simplyluke18 hours ago|

[-]

You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.

reply

upvote

by fny20 hours ago|

[-]

The RAM requirements are still pretty painful.

reply

upvote

by yieldcrv20 hours ago|

[-]

equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

reply

upvote

by stingraycharles20 hours ago|

[-]

Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.

reply

upvote

by 3stacks19 hours ago|

[-]

Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity

reply

upvote

by stingraycharles10 hours ago|

[-]

This really sounds like “640kb should be enough”.

I’m sorry, but I just can’t imagine us running smaller models than we are using right now in 5-10 years from now.

reply

upvote

by hajile3 hours ago|

[-]

We've already hit RAM power and size limits (about 40k electrons which is the limit before we get noise messing up the amplifier).

If a model needs 2x more memory, but serves the same number of customers, the cost is going to go up per customer to cover the increased hardware and power costs. Companies are starting to implement AI limits to keep costs under control.

Anthropic and OpenAI are rumored to be considering cutting inference prices trying to retain customers as LLMs commoditize and race to the bottom. It reminds me of the Chinese bike wars where bike-share companies were losing massive amounts of money, but kept running sales and lowering prices in an attempt to compete and drive out their competitors. The end of that story was a bunch of major bankruptcies and giant bike graveyards.

Nvidia's hard pivot to "in the near future, everyone will run their AI at home" seems to indicate that they also see the market shifting. We've already had AI ingest everything out there. The real challenge becomes how to better optimize their algorithm to get more useful data in less space.

reply

upvote

by yieldcrv19 hours ago|

[-]

have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range

a lot of innovation occurring

reply

upvote

by scosman17 hours ago|

[-]

It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

reply

upvote

by oceanplexian4 hours ago|

[-]

It depends what you’re using it for. Real time interactive Claude code session? No, it’s kind of impractical.

But if you already have agent loops dialed in (For example I have one that uses a browser testing framework), it wouldn’t really affect me at all if it crunched away at 7 tokens per second all night long.

reply

upvote

by leansensei3 hours ago|

[-]

Not really, you can do great things without them. I've been summarizing hundreds of documents. I've added MCP servers to my internal business tools (Elixir apps) and can chat with the Nous Hermes agent over Telegram about pending orders, inventory level, historical product prices, etc., without having to click/dick around with a web UI.

Sure, it cannot replace SOTA models for agentic coding, except for small, well-scoped refactorings. But even a model like ministral-3:8b or qwen3.5:9b is a boon for so many smaller use cases!

reply

upvote

by CamouflagedKiwi20 hours ago|

[-]

The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.

reply

upvote

by stymaar18 hours ago|

[-]

Honestly, Qwen3.6 is already what you need for the large majority of tasks.

(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).

reply

upvote

by notatoad19 hours ago|

[-]

locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

reply

upvote

by oceanplexian4 hours ago|

[-]

Yeah, 20 months of Claude Max until they rugpull you. I’m spending 7-10k/month in raw token costs on Claude Max. Having an alternative is a nice insurance policy.

reply

upvote

by chatmasta19 hours ago|

[-]

Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.

reply

upvote

by fc417fc80214 hours ago|

[-]

> at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

How so? Model capability at a fixed hardware level has been consistently (and rapidly) increasing. You might or might not be able to run state of the art 2 (or 4 or whatever) years from now but you can reasonably expect the hardware to last upwards of a decade with model performance consistently improving over that time frame.

You can get a tolerable (at least by some metrics) experience using 10 year old hardware today.

reply

upvote

by c7b13 hours ago|

[-]

You can get a 128GB Strix Halo for under $3k. Used to be under $2k. Even if you believe it'll be completely obsolete for AI in two years, it'll still be good for many other things. Games for at least several more years, a great home server and/or desktop almost indefinitely. Plus, we might actually reach good enough levels for some AI use cases, if we're not already there.

And never underestimate the potential for enshittification. Your local rig will only deliver better performance over time as more and more tweaks come out. With cloud services expect the opposite to happen as subsidies run out. It's entirely possible that they will intersect on a bang per buck basis within two years.

reply

upvote

by tomr7519 hours ago|

[-]

people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already

reply

upvote

by SXX11 hours ago|

[-]

You forget that after 2 years you still gonna have said Mac Studio that can be sold off for 30-50% of the price.

Of course its gonna lose value faster if something magical happen with hardware manufacturing, but you'll likely get 25% back at least.

On other side you cant really predict how valuable claude max gonna be in a year because Anthropic can further enshittify it.

reply

upvote

by fsuts15 hours ago|

[-]

Why do you think they are rushing to IPO!!

reply