For inference workloads, it makes a lot more sense to optimize for prefill/ttft before maxing out memory bandwidth.
The software side is still pretty sketchy, too. Apple's ecosystem is fractured between NPU, MPS and Accelerate BLAS, with libraries like MLX and CoreML built precariously overtop. Apple has to commit to a full rearchitecture of their GPU to challenge Nvidia, which fractures that ecosystem even further.
Apple's business model will be to pay Google for compute for now, and then as they get better on device, move more and more locally. So they're very well incentivized to get better. The thing they've been best at in the last 19 years has been spinning flywheels they already have, and this is exactly that.
It's not hard to see why Apple made those mistakes, and many of them were made by the rest of the industry too. It's specifically tragic that Apple snatched defeat from the jaws of victory with GPGPU programming, and it makes me think that their future will be more subscription services and less half-ass technical efforts. Or they rip up the foundation and start from scratch, it's never too late to start work on Apple Silicon 2.
When you layer those same models across 128gb of dGPUs, then you can actually fill the KV cache in seconds, instead of minutes. And you get higher memory bandwidth on most professional cards.