upvote
Intel suffers just as much when NUMA enters the picture, even prior to CCD style architecture. That extra latency hop across to the other core to get at memory is absolutely crippling, especially in a hot loop. It requires very careful handling, while being this kind of invisible element (unless you know to look for it, nothing will draw your attention to it)
reply
Hundreds of cores is likely two sockets and so you've got NUMA there.

Scaling to large core counts has a lot of gotchas.

reply
deleted
reply