undefined

points

[-]

I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac? To elaborate, the article just says that this step is compute bound, but I'm wondering whether it is just that simple or if it might also be less optimised in MLX?

by Aurornis18 hours ago|

parent|

[-]

Prefill (prompt processing) is compute bound doing large matrix operations. Token generation (aka tokens/s) is memory bandwidth bound.

The RTX 5090 has an incredible amount of compute performance for matrix operations and a lot of memory bandwidth. The Apple Silicon parts have unusually high memory bandwidth for general purpose compute chips, which is why they can generate tokens so fast. Their raw matrix compute performance is amazing for their power envelope but not nearly as fast as a dedicated GPU consuming 400-500W.

Apple added tensor cores on the M5 generation which help with those matrix operations, which is why the M5 performs so much better than the M4 Max in that article.

Dedicate GPUs like the RTX 5090 are in another league, though.

You can see the divergence in the high resolution gaming benchmarks, too. Once he starts benchmarking at 4K or 6K where the CPU emulation stops being a bottleneck, the raw compute of the 5090 completely crushes any of the Apple Silicon GPUs.

by ademeure18 hours ago|

parent|

prev|

[-]

Apple GPUs didn’t have tensor cores until the M5 (aka “a neural accelerator in each core”) and in the article’s charts that a M5 Pro significantly beats a M4 Max (while in other workloads it would be much smaller since Pro is ~1/2 Max).

EDIT: since Aurornis beat me by 3 minutes, I’ll add another interesting tidbit instead :)

NVIDIA tensor cores on consumer GPUs are massively less powerful per SM core than on their datacenter counterparts-parts (which also makes them easier to get to peak efficiency on consumer GPUs because the rest of the pipeline is much more quickly a bottleneck as per Amdahl’s Law).

This is potentially changing with Vera Rubin CPX which looks an awful lot like a RTX 5090 replacement but with the full-blown datacenter tensor cores (that won’t be available unless you pay for the datacenter SKU) - so it will have very high TFLOPS relative to its bandwidth.

The target market for the CPX is exactly this: prefill and Time To First Token. You can basically just throw compute at the problem for (parts of) prefill performance (but it won’t help anything else past a certain point) and the 5090/M5 are nowhere near that limit.

So the design choice for NVIDIA/Apple/etc of how much silicon to spend for this on consumer GPUs is mostly dictated by economics and how much they can reuse the same chips for the different markets.

by Melatonic14 hours ago|

parent|

[-]

Does that include stuff like the Pro Blackwell 6000? Or are the tensor cores as good per SM comparably? They perform quite well on many tests

by aviinuo14 hours ago|

parent|

[-]

Pro Blackwell 6000 is just a 5090 with more VRAM. It does not have the tcgen05 (5th gen tensor core) instructions despite the "5th gen tensor core) branding and thus do not support any optimized Blackwell (sm100) kernels.

Every Blackwell card other than the (G)B100, (G)B200, (G)B300 and Jetson Thor, use the Ampere tensor core instruction (mma.sync) but with fp4/6/8 added on. Beyond that the DGX Spark (which is advertised as having the same architecture as B200) has especially weak (not tcgen05) tensor cores that have a very narrow operating window and low utilization.

by mathisfun12318 hours ago|

parent|

prev|

[-]

> I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac?

because the GPUs aren't as fantastic as everyone assumes?

> might also be less optimised in MLX?

prefill has gotta be one of the most optimized paths in MLX...

by bigyabai15 hours ago|

parent|

[-]

No you don't understand, on Apple Silicon my CPU has comparable memory bandwidth to a $400 Pascal-era GPU. With the unified memory architecture, that means my iGPU gets 2016-levels of DDR transfer speed with none of the upsides of CUDA. It's the most cutting-edge hardware ever put in a personal computer, without a doubt.

by fgfarben3 hours ago|

parent|

[-]

Please show me on the 2016-era $400 Pascal GPU where you can install the 256 GB of VRAM.

by Moosdijk17 hours ago|

prev|

[-]

It feels pedantic to point it out, but it’s actually 113x faster.

Seeing the author present their results like this give off the impression that they’re biased, which I am sure they aren’t.

by scottjg16 hours ago|

parent|

[-]