Agree about the physics; disagree about the larger point.
I am not questioning that servers packed together may achieve an optimal result in how we are currently doing things, but, and this is my point, what if we didn't.
<< you cannot get that with distributed training
This is entirely the wrong question to ask. The question to ask is: how it could be adapted to distributed training.