undefined

points

[-]

I guess this was more related to syncing GPUs.

If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.

But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.

You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.

by incrudible11 hours ago|

parent|

[-]

You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.

by filup10 hours ago|

parent|

[-]

That sounds like the way. Everyone trains their own small problems to maximally compressed weights and then merges.

by zozbot23412 hours ago|

prev|

[-]

The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.

by GeoAtreides8 hours ago|

prev|

[-]

Math is math, but sadly math isn't physics nor engineering.

by pvirgiliu5 hours ago|

parent|

[-]

math has physics.

by 3 hours ago|

parent|

[-]

deleted