upvote
On M5 Pro/Max the memory is actually just attached straight to the GPU die. CPU accesses memory through the die-to-die bridge. I don't see the difference between that and a pure GPU from a memory connectivity point of view.

Wrt inference servers: sure, it's not cost-effective to have such a huge CPU die and a bunch of media accelerators on the GPU die if you just care about raw compute for inference and training. Apple SoCs are not tuned for that market, nor do they sell into it. I'm not building a datacentre, I'm trying to run inference on my home hardware that I also want to use for other things.

reply
Industrial Scale Inference is moving towards LPDDR memory (alongside HBM), which is essentially what "Unified Memory" is.
reply
> which is essentially what "Unified Memory" is.

Unified memory is when CPU and GPU can reference the same memory address without things being copied (CUDA allows you to write code as if it was unified even if it's not, so that doesn't count, but HMM does count[1])

That is all. What technology is underneath is hardware detail. Unified memory on macs lets you put something into a memory, then do some computation on it with CPU, ANE, ANA, Metal Shaders. All without copying anything.

DGX Spark also has unified memory.

[1]: https://docs.nvidia.com/cuda/cuda-programming-guide/02-basic...

reply
LPDDR is LPDDR. There's nothing "unified" about it architecturally.
reply
Unified Memory is mainly how consumer hardware has enough RAM accessible by the GPU to run larger models, because otherwise the market segmentation jacks up the price substantially.
reply
UMA removes the PCIe bottleneck and replaces it with a memory controller + bandwidth bottleneck. For most high-performance GPUs, that would be a direct downgrade.
reply
deleted
reply