undefined

points

[-]

It does not. For any of the dual CCD parts AMD has ever released for consumers. Even Strix Halo which has higher bandwidth, lower latency interconnect doesn't make a single L3 across CCDs.

It'll probably only happen when they have a singular, large die filled with cache upon which both CCDs are stacked.

Run this test if you're curious: https://github.com/ChipsandCheese/MemoryLatencyTest

On a regular CCD:

32768,46.115

65536,74.243

98304,85.699

131072,91.42

262144,99.402

On a 3D cache CCD:

32768,11.992

65536,12.712

98304,29.921

131072,49.91

262144,86.059

by phire10 hours ago|

prev|

[-]

The short answer is that L3 is local to each CCD.

And that answer is good enough for most workloads. You should stop reading now.

_______________________

The complex answer is that there is some ability one CCD to pull cachelines from the other CCD. But I've never been able to find a solid answer for the limitations on this. I know it can pull a dirty cache line from the L1/L2 of another CCDs (this is the core-to-core latency test you often see in benchmarks, and there is an obvious cross-die latency hit).

But I'm not sure it can pull a clean cacheline from another CCD at all, or if those just get redirected to main memory (as the latency to main memory isn't that much higher than between CCDs). And even if it can pull a clean cacheline, I'm not sure it can pull them from another CCD's L3 (which is an eviction cache, so only holds clean cachelines).

The only way for a cacheline to get into a CCD's L3 is to be evicted from an L2 on that core, so if a dataset is active across both CCDs, it will end up duplicated across both L3s. Cachelines evicted from one L3 do NOT end up in another L3, so an idle CCD can't act as a pseudo L4.

I haven't seen anyone make a benchmark which would show the effect, if it exists.

by undersuit10 hours ago|

prev|

[-]

AMD didn't have to introduce a special driver for the Ryzen 9 5950x to keep threads resident to the "gaming" CCD. There was only a small difference between the 5950x and the non-X3d Ryzen 7 5800x in workloads that didn't use more than 8 cores unlike the observed slowdowns in the Ryzen 9s 7950X3D and 7900X3D when they were released compared to the Ryzen 7 7800X3D .

When the L3 sizes are different across CCDs the special AMD driver is needed to keep threads pinned to the larger L3 CCD and prevent them from being placed on the small L3 CCD where their memory requests can exploit the other CCD's L3 as an L4. The AMD driver reduces CCD to CCD data requests by keeping programs contained in one CCD.

With equal L3 caches when a process spills onto the second CCD it will still use the first's L3 cache as "L4" but it no longer has to evict that data at the same rate as the lopsided models. Additionally the first CCD can use the second CCD's L3 in kind reducing the number of requests that need to go to main memory.

The same sized L3s reduce contention to the IO die and the larger sized L3s reduce memory contention, it's a win-win.

https://www.phoronix.com/review/amd-3d-vcache-optimizer-9950...