upvote
It's more than the raw hardware, it's the interconnect and communication between the hardware at scale. These models are trained on hundreds of thousands of GPUs today. You _will_ start to see cross-datacenter training runs but this needs to efficiently decide when and how to communicate across datacenter, which bears a very high cost compared to intra-datacenter communication.
reply
DGX Spark is effectively prosumer hardware, better than most consumer stuff but still not comparable to actual datacenter gear. You can't just look at TDP in isolation without also comparing performance.
reply