undefined

points

[-]

Memory bandwidth is the bottleneck in the Spark. If you replace the SoC with an optimized ASIC but keep the same 256-bit LPDDR5 the performance will be the same. You can increase performance by using wider memory but that's also more expensive.

by phonon1 days ago|

parent|

[-]

M3 Ultra has a 1024 bit memory bus (819 GB/s) and starts at $3,999 (96GB of RAM). It can be done....

by bigyabai1 days ago|

parent|

[-]

The tradeoff is that the M3 Ultra's GPU loses to laptop GPUs in compute benchmarks. All of that bandwidth is wasted idling for token prefill.

For inference workloads, it makes a lot more sense to optimize for prefill/ttft before maxing out memory bandwidth.

by Schiendelman19 hours ago|

parent|

[-]

With the M6 theoretically coming later this year, Apple seems to be realizing they need to catch up with more lanes of GPU.

by bigyabai19 hours ago|

parent|

[-]

Personally, I doubt it. Apple hamstrung themselves with unified SOC memory, there are cheap dGPUs that smoke the M5's prefill speeds and even have faster decode too. Apple is running up against the limitations of putting a mobile integrated chipset up against the desktop form factor. An SOC stops looking like a smart decision at that scale.

The software side is still pretty sketchy, too. Apple's ecosystem is fractured between NPU, MPS and Accelerate BLAS, with libraries like MLX and CoreML built precariously overtop. Apple has to commit to a full rearchitecture of their GPU to challenge Nvidia, which fractures that ecosystem even further.

by Schiendelman17 hours ago|

parent|

[-]

I don't expect them to be AS fast as Nvidia anytime soon. Understood that they need architectural improvements to get there.

Apple's business model will be to pay Google for compute for now, and then as they get better on device, move more and more locally. So they're very well incentivized to get better. The thing they've been best at in the last 19 years has been spinning flywheels they already have, and this is exactly that.

by bigyabai16 hours ago|

parent|

[-]

I'm just genuinely convinced that Apple's AI flywheel is going in reverse. Their killed their golden goose with OpenCL, which had a genuine shot at dethroning CUDA if Apple took it seriously. It had industry-wide buy in and multiple implementations before Apple threw in the towel. When they designed Apple Silicon, they could have used the lessons learned from that experience to create a CUDA-like ALU layer instead of focusing on raster efficiency for their GPUs. Nvidia had proven that it was possible with low-power ARM SOCs like Jetson and Tegra which did deliver CUDA in handheld experiences. But Apple chose instead to delegate AI to the NPU, which is now dark silicon on devices that defer to MPS backends for most inference. The architecture is locked in to an expensive and suboptimal raster-first GPU design.

It's not hard to see why Apple made those mistakes, and many of them were made by the rest of the industry too. It's specifically tragic that Apple snatched defeat from the jaws of victory with GPGPU programming, and it makes me think that their future will be more subscription services and less half-ass technical efforts. Or they rip up the foundation and start from scratch, it's never too late to start work on Apple Silicon 2.

by Schiendelman16 hours ago|

parent|

[-]

I think it's easy to understand why Apple wouldn't build low level engineering solutions - they'd rather control the platform and just have developers call MLX. I'm not sure, if I was in their shoes, that I'd make the same call. But it's a call, and it's consistent with the rest of their ecosystem decisions.

by wmf17 hours ago|

parent|

prev|

[-]

I love those 128 GB dGPUs.

by bigyabai17 hours ago|

parent|

[-]

Me too! The problem is that people don't love having 128gb of DDR5 held back with a laptop-grade iGPU. It puts up strictly non-interactive speed for LLMs of that size.

When you layer those same models across 128gb of dGPUs, then you can actually fill the KV cache in seconds, instead of minutes. And you get higher memory bandwidth on most professional cards.

by smith70181 days ago|

prev|

[-]

Unfortunately Sam Altman won't be the one to deliver us at-home hardware that can run Opus-level models

by blitzar22 hours ago|

parent|

[-]

I wonder what is happening with the OpenAI / Jony Ive crossover episode.

by flyinglizard1 days ago|

prev|

[-]

Forget about it. Datacenter class hardware is getting farther and farther from desktop use. It’s not PCIe GPUs anymore.