undefined

points

[-]

> None of the things people care about really get much out of "unified memory". GPUs need a lot of memory bandwidth, but CPUs generally don't and it's rare to find something which is memory bandwidth bound on a CPU that doesn't run better on a GPU to begin with. Not having to copy data between the CPU and GPU is nice on paper but again there isn't much in the way of workloads where that was a significant bottleneck.

the bottleneck in lots of database workloads is memory bandwidth. for example, hash join performance with a build side table that doesn't fit in L2 cache. if you analyze this workload with perf, assuming you have a well written hash join implementation, you will see something like 0.1 instructions per cycle, and the memory bandwidth will be completely maxed out.

similarly, while there have been some attempts at GPU accelerated databases, they have mostly failed exactly because the cost of moving data from the CPU to the GPU is too high to be worth it.

i wish aws and the other cloud providers would offer arm servers with apple m-series levels of memory bandwidth per core, it would be a game changer for analytical databases. i also wish they would offer local NVMe drives with reasonable bandwidth - the current offerings are terrible (https://databasearchitects.blogspot.com/2024/02/ssds-have-be...)

by AnthonyMouse3 hours ago|

parent|

[-]

> the bottleneck in lots of database workloads is memory bandwidth.

It can be depending on the operation and the system, but database workloads also tend to run on servers that have significantly more memory bandwidth:

> i wish aws and the other cloud providers would offer arm servers with apple m-series levels of memory bandwidth per core, it would be a game changer for analytical databases.

There are x64 systems with that. Socket SP5 (Epyc) has ~600GB/s per socket and allows two-socket systems, Intel has systems with up to 8 sockets. Apple Silicon maxes out at ~800GB/s (M3 Ultra) with 28-32 cores (20-24 P-cores) and one "socket". If you drop a pair of 8-core CPUs in a dual socket x64 system you would have ~1200GB/s and 16 cores (if you're trying to maximize memory bandwidth per core).

The "problem" is that system would take up the same amount of rack space as the same system configured with 128-core CPUs or similar, so most of the cloud providers will use the higher core count systems for virtual servers, and then they have the same memory bandwidth per socket and correspondingly less per core. You could probably find one that offers the thing you want if you look around (maybe Hetzner dedicated servers?) but you can expect it to be more expensive per core for the same reason.

by NetMageSCW7 hours ago|

prev|

[-]

>The sad part is that DDR5 is the thing that doesn't need to be soldered, unlike GDDR. But then Apple solders it anyway.

Apple needs to solder it because they are attaching it directly to the SOC to minimize lead length and that is part of how they are able to get that bandwidth.

by AnthonyMouse3 hours ago|

parent|

[-]

Systems with socketed RAM have had on-die memory controllers for more than two decades. CAMM2 supports the same speeds as Apple is using in the M5.

by nsteel6 hours ago|

prev|

[-]

Except they don't use DDR5. LPDDR5 is always soldered. LPDDR5 requires short point-to-point connections to give you good SI at high speeds and low voltages. To get the same with DDR5 DIMMs, you'd have something physically much bigger, with way worse SI, with higher power, and with higher latency. That would be a much worse solution. GDDR is much higher power, the solution would end up bigger. Plus it's useless for system memory so now you need two memory types. LPDDR5 is the only sensible choice.

by AnthonyMouse3 hours ago|

parent|

[-]

> LPDDR5 is always soldered.

No it isn't:

https://www.newegg.com/crucial-32gb-ddr5-7500-cas-latency-cl...

CAMM2 is new and most of the PC companies aren't using it yet but it's exactly the sort of thing Apple used to be an early adopter of when they wanted to be.

by NetMageSCW2 hours ago|

parent|

[-]

It looks like LPCAMM2 is shipping from one vendor and only started shipping in October- that’s a bit quick and early for Apple to adopt.

by Melatonic5 hours ago|

parent|

prev|

[-]

Is it really useless for system memory or is it just too expensive and no manufacturer has bothered?

by actionfromafar12 hours ago|

prev|

[-]

> Not having to copy data between the CPU and GPU is nice on paper but again there isn't much in the way of workloads where that was a significant bottleneck.

Isn't that also because that's world we have optimized workloads for?

If the common hardware had unified memory, software would have exploited that I imagine. Hardware and software is in a co-evolutionary loop.

by AnthonyMouse1 hours ago|

parent|

[-]

Sort of?

Part of the problem is that there is actually a reason for the distinction, because GPUs need faster memory but faster memory is more expensive, so then it makes sense to have e.g. 8GB of GDDR for the GPU and 32GB of DDR for the CPU, because that costs way less than 40GB of GDDR. So there is an incentive for many systems to exist that do it that way, and therefore a disincentive to write anything that assumes copying between them is free because it would run like trash on too large a proportion of systems even if some large plurality of them had unified memory.

A sensible way of doing this is to use a cache hierarchy. You put e.g. 8GB of expensive GDDR/HBM on the APU package (which can still be upgraded by replacing the APU) and then 32GB of less expensive DDR in slots on the system board. Then you have "unified memory" without needing to buy 40GB of GDDR. The first 8GB is faster and the CPU and GPU both have access to both. It's kind of surprising that this configuration isn't more common. Probably the main thing you'd need is for the APU to have a direct power connector like a GPU so you're not trying to deliver most of a kilowatt through the socket in high end configurations, but that doesn't explain why e.g. there is no 65W CPU + 100W GPU with a bit of GDDR to be put in the existing 170W AM5 socket.

However, even if that was everywhere, it's still doesn't necessarily imply there are a lot of things that could do much with it. You would need something that simultaneously requires more single-thread performance than you can get from a GPU, more parallel computation than you can get from a high-end CPU, and requires a large amount of data to be repeatedly shared between those subsets of the computation. Such things probably exist but it's not obvious that they're very common.