upvote
I don't expect them to be AS fast as Nvidia anytime soon. Understood that they need architectural improvements to get there.

Apple's business model will be to pay Google for compute for now, and then as they get better on device, move more and more locally. So they're very well incentivized to get better. The thing they've been best at in the last 19 years has been spinning flywheels they already have, and this is exactly that.

reply
I'm just genuinely convinced that Apple's AI flywheel is going in reverse. Their killed their golden goose with OpenCL, which had a genuine shot at dethroning CUDA if Apple took it seriously. It had industry-wide buy in and multiple implementations before Apple threw in the towel. When they designed Apple Silicon, they could have used the lessons learned from that experience to create a CUDA-like ALU layer instead of focusing on raster efficiency for their GPUs. Nvidia had proven that it was possible with low-power ARM SOCs like Jetson and Tegra which did deliver CUDA in handheld experiences. But Apple chose instead to delegate AI to the NPU, which is now dark silicon on devices that defer to MPS backends for most inference. The architecture is locked in to an expensive and suboptimal raster-first GPU design.

It's not hard to see why Apple made those mistakes, and many of them were made by the rest of the industry too. It's specifically tragic that Apple snatched defeat from the jaws of victory with GPGPU programming, and it makes me think that their future will be more subscription services and less half-ass technical efforts. Or they rip up the foundation and start from scratch, it's never too late to start work on Apple Silicon 2.

reply
I think it's easy to understand why Apple wouldn't build low level engineering solutions - they'd rather control the platform and just have developers call MLX. I'm not sure, if I was in their shoes, that I'd make the same call. But it's a call, and it's consistent with the rest of their ecosystem decisions.
reply
I love those 128 GB dGPUs.
reply
Me too! The problem is that people don't love having 128gb of DDR5 held back with a laptop-grade iGPU. It puts up strictly non-interactive speed for LLMs of that size.

When you layer those same models across 128gb of dGPUs, then you can actually fill the KV cache in seconds, instead of minutes. And you get higher memory bandwidth on most professional cards.

reply