The RTX 5090 has an incredible amount of compute performance for matrix operations and a lot of memory bandwidth. The Apple Silicon parts have unusually high memory bandwidth for general purpose compute chips, which is why they can generate tokens so fast. Their raw matrix compute performance is amazing for their power envelope but not nearly as fast as a dedicated GPU consuming 400-500W.
Apple added tensor cores on the M5 generation which help with those matrix operations, which is why the M5 performs so much better than the M4 Max in that article.
Dedicate GPUs like the RTX 5090 are in another league, though.
You can see the divergence in the high resolution gaming benchmarks, too. Once he starts benchmarking at 4K or 6K where the CPU emulation stops being a bottleneck, the raw compute of the 5090 completely crushes any of the Apple Silicon GPUs.
EDIT: since Aurornis beat me by 3 minutes, I’ll add another interesting tidbit instead :)
NVIDIA tensor cores on consumer GPUs are massively less powerful per SM core than on their datacenter counterparts-parts (which also makes them easier to get to peak efficiency on consumer GPUs because the rest of the pipeline is much more quickly a bottleneck as per Amdahl’s Law).
This is potentially changing with Vera Rubin CPX which looks an awful lot like a RTX 5090 replacement but with the full-blown datacenter tensor cores (that won’t be available unless you pay for the datacenter SKU) - so it will have very high TFLOPS relative to its bandwidth.
The target market for the CPX is exactly this: prefill and Time To First Token. You can basically just throw compute at the problem for (parts of) prefill performance (but it won’t help anything else past a certain point) and the 5090/M5 are nowhere near that limit.
So the design choice for NVIDIA/Apple/etc of how much silicon to spend for this on consumer GPUs is mostly dictated by economics and how much they can reuse the same chips for the different markets.
Every Blackwell card other than the (G)B100, (G)B200, (G)B300 and Jetson Thor, use the Ampere tensor core instruction (mma.sync) but with fp4/6/8 added on. Beyond that the DGX Spark (which is advertised as having the same architecture as B200) has especially weak (not tcgen05) tensor cores that have a very narrow operating window and low utilization.
because the GPUs aren't as fantastic as everyone assumes?
> might also be less optimised in MLX?
prefill has gotta be one of the most optimized paths in MLX...
Seeing the author present their results like this give off the impression that they’re biased, which I am sure they aren’t.