Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.
It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass.