undefined

[-]

You got it wrong. Inference can use crap GPU's. Training needs the 100x more expensive big guns. Our training machine is 100x more expensive than our inference machine.

by bombcar2 hours ago|

[-]

How is the result of training stored? How big is that? It seems reasonable to assume we’ll eventually plateau and all we’ll need is relatively infrequent training.

by rurban36 minutes ago|

[-]

Not so often. The GPU's are running 100% for 3 weeks for a training run. We do images only, but it's the same process. And then we can use the costly GPU's for inference, local model coding agents. Training is about 4x a year. But it depends what ideas the PM or the costumers have. If they has more, more training tasks. Eg. more viruses to detect.

by brandensilva1 hours ago|

[-]

I agree, leave the training to open source federations that roll out like operating systems. Minimal training over time.

Then have inference go down to the next layer to use those models as a P2P decentralized network.

Maybe like open router could tap federation networks.

by sho12 hours ago|

[-]

> AI hardware is for inference, not training

Not sure what you are referring to, unless you don't think h100/h200/b200 are "AI hardware"

> Superpods aren't really power efficient

Maybe not compared to a specialized rig with multiple 4090s, but that is the best case for consumer hardware - the vast majority will be dramatically less efficient than that

Anyway, I agree the interconnect is by far the biggest obstacle and seems insurmountable, I should probably have led with that.

by pksebben13 hours ago|

[-]

Bit of a doozie though, that one.

I recall getting really excited over hinton's FF foray, right before he bailed on AI as a societal direction (which, if anyone ever had the right, I suppose he does). If one squints, one can see a backprop-free base being much easier to train on geographically distributed and heterogenous hardware.

by Davidzheng9 hours ago|

[-]

Are you sure most of frontier cost isn't inference in RL environments?

by dyauspitr12 hours ago|

[-]

That makes no sense. It’s basically the same calculations for training as well.

by iugtmkbdfil8341 hours ago|

[-]

Dunno, in a sense, torrents came among similar restrictions. Everything at consumer level was just plain awful and at dial up level, mebbe ISDN if you were very lucky, with fiber only available to ridiculously rich people and corps. But with restrictions, came approaches on how to mitigate them.

[-]

Yes but not violations of the laws of physics. You need extremely fast communications, memory bandwidth, etc; you cannot get that with distributed training. You're up against the speed of light and the interconnect that powers the internet. You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.

by iugtmkbdfil83423 minutes ago|

[-]

<< You will always have horrifically slow latency compared to if you pack the servers together in the same place with specialized networking.

Agree about the physics; disagree about the larger point.

I am not questioning that servers packed together may achieve an optimal result in how we are currently doing things, but, and this is my point, what if we didn't.

<< you cannot get that with distributed training

This is entirely the wrong question to ask. The question to ask is: how it could be adapted to distributed training.

by boutell1 hours ago|

[-]

If weights can't be looked at almost instantly in bulk, it just doesn't work. It's a different problem from distributing file downloads.

by iugtmkbdfil83427 minutes ago|

[-]

I used it as an example. I understand the problem is hard. My larger point was that this is exactly how actual progress tends to take place. Well, that and porn.

by c7b7 hours ago|

[-]

Could you put some numbers and examples behind the efficiency gap between data center and consumer-grade AI hardware? Did you include examples like the RTX Spark on the consumer side? I was always amazed at the low power consumption of unified memory style architectures. In absolute terms and even more so compared to consumer-grade GPUs. I'd be genuinely interested in a comparison with data-center-grade hardware.

[-]

It's more than the raw hardware, it's the interconnect and communication between the hardware at scale. These models are trained on hundreds of thousands of GPUs today. You _will_ start to see cross-datacenter training runs but this needs to efficiently decide when and how to communicate across datacenter, which bears a very high cost compared to intra-datacenter communication.

by zozbot2346 hours ago|

[-]

DGX Spark is effectively prosumer hardware, better than most consumer stuff but still not comparable to actual datacenter gear. You can't just look at TDP in isolation without also comparing performance.

by CuriouslyC6 hours ago|

[-]

> It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.

100% agree. The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it. The people owning the compute infrastructure and capturing more profit from AI at that layer is the safest, cleanest way to increase revenue capture, a sovereign wealth fund is a mediocre idea because it's possible to play shell game with stocks and redirect profit/debt (venture capital is quite good at this!).

[-]

> The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it.

Any actual numbers to back this up? I don't see how nationalizing a very cutting edge technology outside of wartime is going to go super well. The leverage that these companies have is the same leverage that TSMC has: you can't just take over and expect things to rocket at the pace its going

by root-parent6 hours ago|

[-]

>> The US government basically has to nationalize AI and capture an outsize portion of the revenue from it

Currently AI has generated no profit. And as it sits, is a non viable business.

I refuse to include the sellers of shovels as AI revenue.

If the companies buying the shovels are still losing money, then the tool supplier fortunes have nothing to do with the economics of the AI application layer, who is losing money on every prompt.

[-]

It's the most naive opinion that keeps getting shoveled around. You have a product that is viewed as essential by businesses, with revenue growing by 10x a year and geopolitical ramifications that have continued to rear their heads and your opinion is "this is all an unprofitable shill". It is extraordinary to me that people really believe this. Whether or not labs run at a loss today is absolutely irrelevant. There is of course steady state economics that make sense, and its currently not well known what the profitability picture is right now, so to say "Currently AI has generated no profit" is also just speculation and not a very insightful one at that.

by CuriouslyC5 hours ago|

[-]

I've heard that the API calls by themselves are ~60% profit if you ignore capital expenditures. The labs haven't generated profit because they're constantly sinking money into the next generation of larger models to stay relevant. Dario has talked about the economics of this a lot, and I do believe him there.

There's clearly also a lot of pent up demand in the corporate world for inference, the problem is that it's currently expensive enough that enterprises are balking at the cost before they've had a chance to refine processes and see projects through to fruition. That's a tractable problem to solve though.

by bombcar2 hours ago|

[-]

The number of capital-heavy businesses that are wildly profitable “if you ignore capital expenses” is too many to list.

Airlines, for example, which are so profitable they continually go bankrupt.

by CuriouslyC2 hours ago|

[-]

That's true, but if the frontier doesn't advance there's no depreciation or ongoing capital expenditure. If all the frontier labs agreed to stop making stronger AI and just try to sell what they've already trained today, their books would turn green in a hurry.

by WithinReason11 hours ago|

[-]

Efficiency difference between training on GPUs and TPUs is 2x at best. You can get very efficient with tensorcores, converging to TPU efficiency. In the end math is math, you can't make a multiplication more efficient than it already is on GPU.

by schobi11 hours ago|

[-]

I guess this was more related to syncing GPUs.

If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.

But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.

You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.

by incrudible10 hours ago|

[-]

You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.

by filup9 hours ago|

[-]

That sounds like the way. Everyone trains their own small problems to maximally compressed weights and then merges.

by zozbot23411 hours ago|

[-]

The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.

by GeoAtreides6 hours ago|

[-]

Math is math, but sadly math isn't physics nor engineering.

by pvirgiliu3 hours ago|

[-]

math has physics.

by 2 hours ago|

[-]

deleted

by herewulf6 hours ago|

[-]

WRT government data centers, there is certainly precedent for independent researchers getting HPC time on systems owned by US national labs, research institutions, universities, and then publishing their results as part of the public good.

One would question why this hasn't already happened as the rule and as opposed to the proliferation of private data centers. However, I am sure the answers are plain and perhaps saddening to us all.

by Cider998612 hours ago|

[-]

What makes you think Deepseek or GLM won't catch up to Fable level? Why would there be a break in the trend now?

by zozbot23410 hours ago|

[-]

DeepSeek and GLM (plus Kimi) are at or above Sonnet level wrt. favorable workloads like coding. They're not close to Opus or the latest GPT yet, and Fable is even higher than that. Other workloads relying more on real-world knowledge have them even further behind, and this can't be mitigated without making the model itself bigger and harder to host locally.

by CuriouslyC6 hours ago|

[-]

Not true. Big models buy you baked in knowledge and long context cohesion. A model can be trained to use search and knowledge base tools more efficiently to mitigate the former, and harnesses/workflows can be designed to push models into small parallel threads to mitigate the latter.

The thing that big models will always bring to the table is the ability to YOLO weak/under-specified prompts, and spend less time in the loop making sure work gets partitioned correctly. For smaller/simpler tasks the P(success) difference isn't that big.

by zozbot2343 hours ago|

[-]

Knowledge-base access is not very useful in general because a model doesn't have well-defined "known unknowns" that might trigger an agentic search of the outside knowledge base. Plus surfacing knowledge you don't know much about is itself hard.

by dboreham5 hours ago|

[-]

These things sound plausible, but have they actually been demonstrated? Wouldn't anyone who succeeded in making such a small but useful LLM be raking in the money now?

by CuriouslyC4 hours ago|

[-]

Cursor's composer 2.5 is a perfect example. It's right on the heels of the frontier (for coding only) for an order of magnitude cheaper. As much as I've shit on Cursor in the past, I do think the company is well positioned to pick up people getting sticker shock on Anthropic tokens, if they can get their marketing down.

by zozbot2343 hours ago|

[-]

If that's Kimi-based it would very much be on the larger side of open-weight models (1T params).

by CuriouslyC2 hours ago|

[-]

It is, but the US labs have been pushing parameters heavily. There was a pullback from big models after GPT4.5 in particular, but with a shift towards emphasis on post training and the good results Google got with scaling Gemini 3, all the labs started to push scaling again, which is the reason the frontier is getting more expensive. So that 1T isn't as big as it sounds, the American frontier is probably sitting at 3-5T at least.

by thepasch10 hours ago|

[-]

> They're not close to Opus or the latest GPT yet

Disagreed. GLM-5.1 is easily as good as Opus 4.5 for all the coding purposes I could throw at it, which is the model that kicked this entire hype cycle into overdrive in the first place.

by Cider998610 hours ago|

[-]

I've found GLM to be comparable or better than Opus at writing and at a fraction of the cost.

by zozbot23410 hours ago|

[-]

Writing does not rely on real-world knowledge all that much, other than knowledge of language itself. Even tiny models can achieve that, it's even easier than coding.

by CuriouslyC6 hours ago|

[-]

The challenge with writing is the lab collapsing the distribution around "tasteful" writing, when the people making decisions about training data aren't able to effectively discriminate it.

by metalspot8 hours ago|

[-]

The key thing here is that effective intelligence = model capability / cost. If you drive down the cost of inference you can have higher effective capability even with a technically less capable model. There is nothing in Anthropic/OpenAIs general reasoning capabilities that can't be easily done much better with a purpose built harness for a domain specific task.

by kuboble11 hours ago|

[-]

I think there are at least few question marks.

One being that extrapolating from like 3 data points is hardly science. All trends break at some point.

The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.

by KaiserPro2 hours ago|

[-]

> It would be better for governments to buy and own their own datacenters,

I mean thats good, but they'd have to also build thier own dataset. Which involves either paying people, or breaking the law.

Plus if they do manage to make it work, they will not get any tax revenue from it, as it'll remove the need for labour, which is where a huge amount of tax revenues come from.

its a deeply hard problem with lots of second/third order effects.

by 11 hours ago|

[-]

deleted

by incrudible10 hours ago|