The far, FAR superior power efficiency means that even if you did harness every public GPU or GPU-like device on earth, you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
And even if electricity was free, having those GPUs spread over the world with internet-level latency will slow everything down by factors of thousands to millions - if it's feasible at all. Regardless, you're not getting fable-oss this decade, maybe even not this century.
It would be better for governments to buy and own their own datacenters, maybe as a coalition, and dedicate their operation to the public good. I believe that is what we actually have to do.
> you'd end up consuming so much excess electricity it would be cheaper on net to simply take the money that would have gone to the power bill and spend it on your own datacenter.
Costs spread over a large population, it really doesn't matter. You're not getting hundreds of thousands of people to pitch half their monthly electric bill to pay for someone else's datacenter. They will pay the electricity themselves quite happily though, if all they need to do is give you compute. This isn't new.
Interconnect is the bottleneck for distributed training, nothing else really.
Then have inference go down to the next layer to use those models as a P2P decentralized network.
Maybe like open router could tap federation networks.
Not sure what you are referring to, unless you don't think h100/h200/b200 are "AI hardware"
> Superpods aren't really power efficient
Maybe not compared to a specialized rig with multiple 4090s, but that is the best case for consumer hardware - the vast majority will be dramatically less efficient than that
Anyway, I agree the interconnect is by far the biggest obstacle and seems insurmountable, I should probably have led with that.
I recall getting really excited over hinton's FF foray, right before he bailed on AI as a societal direction (which, if anyone ever had the right, I suppose he does). If one squints, one can see a backprop-free base being much easier to train on geographically distributed and heterogenous hardware.
Agree about the physics; disagree about the larger point.
I am not questioning that servers packed together may achieve an optimal result in how we are currently doing things, but, and this is my point, what if we didn't.
<< you cannot get that with distributed training
This is entirely the wrong question to ask. The question to ask is: how it could be adapted to distributed training.
100% agree. The US government basically has to nationalize AI and capture an outsize portion of the revenue from it in order to fix the economy, as the combination of debt burden and interest rate pressure from de-dollarization/global realignment is going to push us into a death spiral, and even if AI is a smash hit, the ~19% federal capture of corporate revenue isn't nearly enough to pull us out of it. The people owning the compute infrastructure and capturing more profit from AI at that layer is the safest, cleanest way to increase revenue capture, a sovereign wealth fund is a mediocre idea because it's possible to play shell game with stocks and redirect profit/debt (venture capital is quite good at this!).
Any actual numbers to back this up? I don't see how nationalizing a very cutting edge technology outside of wartime is going to go super well. The leverage that these companies have is the same leverage that TSMC has: you can't just take over and expect things to rocket at the pace its going
Currently AI has generated no profit. And as it sits, is a non viable business.
I refuse to include the sellers of shovels as AI revenue.
If the companies buying the shovels are still losing money, then the tool supplier fortunes have nothing to do with the economics of the AI application layer, who is losing money on every prompt.
There's clearly also a lot of pent up demand in the corporate world for inference, the problem is that it's currently expensive enough that enterprises are balking at the cost before they've had a chance to refine processes and see projects through to fruition. That's a tractable problem to solve though.
Airlines, for example, which are so profitable they continually go bankrupt.
If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.
But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.
You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.
One would question why this hasn't already happened as the rule and as opposed to the proliferation of private data centers. However, I am sure the answers are plain and perhaps saddening to us all.
The thing that big models will always bring to the table is the ability to YOLO weak/under-specified prompts, and spend less time in the loop making sure work gets partitioned correctly. For smaller/simpler tasks the P(success) difference isn't that big.
Disagreed. GLM-5.1 is easily as good as Opus 4.5 for all the coding purposes I could throw at it, which is the model that kicked this entire hype cycle into overdrive in the first place.
One being that extrapolating from like 3 data points is hardly science. All trends break at some point.
The other is that the measures to prevent distillation of their models (if it was a secret sauce of Chinese models) could work if nobody is allowed to use them.
I mean thats good, but they'd have to also build thier own dataset. Which involves either paying people, or breaking the law.
Plus if they do manage to make it work, they will not get any tax revenue from it, as it'll remove the need for labour, which is where a huge amount of tax revenues come from.
its a deeply hard problem with lots of second/third order effects.
The first part is not really true though, the chips are not that much faster, the DRAM is not that much faster, and in aggregate it does not matter because there is just so much more consumer hardware out there (although perhaps that is changing as supply shifts toward datacenters).
The interconnect and data locality is the problem. If you could train it like e.g. you can render a scene with monte carlo ray tracing, any result from any node could be merged with any other and the combined result would have converged closer to the limit. I am sure research in that direction exists, it just has not proven effective within the scales it has been attempted.
Models have limited shelf live while things are improving rapidly, and decentralized training is just more wasteful.
However, things might change if we get to what Karpathy calls "cognitive core" - a stable model backbone which can be extended via skills/adapters/etc. Development of extensions to the core can be a lot more decentralized.
But for now these decentralized training attempts function largely as a deterrent to anti-open-source collusion
That just isn't true. It misunderstands exactly how much silicon has gone directly to those companies, and exactly how much more powerful said silicon is compared to consumer grade gear.
Very rough math like I said but I doubt it's directionally wrong.
And even if you did force literally everyone on earth with some sort of GPU to max it out 24/7 in service of an open source AI training enterprise - you would waste so much power trying to use that inefficient consumer hardware with the worst latency imaginable that it would be cheaper and faster to get everyone to instead chip in some cash to buy a datacenter with blackwell chips instead! So the idea has no legs whatsoever.
It's pretty useless to compare raw FLOPS, but as a general hand-waving guesstimate, F@H is currently doing about 25 petaflops in a mix of FP16 and 32. AI usually trains at FP8, but to keep things fair the H100 is quoted at 60 FP64 teraflops per unit, so that's 12 FP64 exaflops given its 200k count.
So F@H at its peak did 2.43 exaflops@FP16/32. Colossus 1 does 12@FP64. These numbers are very hand-wavy, but I think the point is made.
By the way, I'm not trying to crap on F@H - I think it's an outstanding project and I've run it in the past. But a volunteer group simply cannot compete with well-funded, concentrated effort like what's going into AI.
their bloom model was also a collaborative effort https://huggingface.co/docs/transformers/en/model_doc/bloom
Also, it wouldn't be able to use a transformer architecture. For inspiration, take a look at Google Maps and how it a much more efficient A* divide/conquer hill-climbing architecture. Think minimized matrix math.
https://github.com/NousResearch/DisTrO
There are other gradient compression papers from the past reporting large compression rates
Can it be parallelized or not?
If you take a model, make two copies, and fine-tune each one on different data, what happens when you merge them? Does it work if you freeze different layers?
I think this works if the steps are small enough. And the transfer should become tenable if the steps are big enough. Where's the cutoff?
At most a decentralized effort could contribute a little bit to some bigger centralized effort by doing inference and sandboxed CPU work. Modern model training isn't just backprop, it's got a huge and growing CPU and inferencing component too, which doesn't require intense inter-node communication. For instance, doing RL rollouts for agentic coding requires a lot of plain old inferencing and sandboxed containers for the models to practice in. The final results are just a set of rollouts and scores that can be uploaded back to a central datacenter for GRPO to adjust the weights (relatively cheap). But then, of course, you'd have to stick to models small enough to fit on people's computers so it'd never be competitive.
That does mean you are actually neglecting the more difficult issues.
Or is that too close to the plot of The Matrix?
It is already possible: https://arxiv.org/abs/2603.08163 . You don't need to sync so frequently, so it can be done over normal internet, it's just less efficient (takes longer to converge).
I also didn't bring up the concept out of nowhere, this is in response to an article about open source AI. The premise of the post is releasing control to the public. What is more open than a decentralized system? And, why wouldn't you brainstorm in a comment on such a thread?
I also didn't ask an AI for the idea, it's just an idea I have. There's a difference.