undefined

points

[-]

ARM Cortex-X925 achieves indeed a very good IPC, but it has competitive performance only in general-purpose applications that cannot benefit from using array operations (i.e. the vector instructions and registers). The results shown in the parent article for the integer tests of SPEC CPU2017 are probably representative for Cortex-X925 when running this kind of applications.

While the parent article shows AMD Zen 5 having significantly better results in floating-point SPEC CPU2017, these benchmark results are still misleading, because in properly optimized for AVX-512 applications the difference between Zen 5 and Cortex-X925 would be much greater. I have no idea how SPEC has been compiled by the author of the article, but the floating-point results are not consistent with programs optimized for Zen 5.

One disadvantage of Cortex-X925 is having narrower vector instructions and registers, which requires more instructions for the same task and it is only partially compensated by the fact that Cortex-X925 can execute up to 6 128-bit instructions per clock cycle (vs. up to 4 vector instructions per clock cycle for Intel/AMD, but which are wider, 256-bit for Intel and up to 512-bit for Zen 5). This has been shown in the parent article.

The second disadvantage of Cortex-X925 is that it has an unbalanced microarchitecture for vector operations. For decades most CPUs with good vector performance had an equal throughput for fused multiply-add operations and for loads from the L1 cache memory. This is required to ensure that the execution units are fed all the time with operands in many applications.

However, Cortex-X925 can do at most 4 loads, while it can do 6 FMAs. Because of this lower load throughput Cortex-X925 can reach the maximum FMA throughput only much less frequently than the AMD or Intel CPUs. This is compounded by the fact that achieving better FMA to load ratios requires more storage space in the architectural vector registers, and Cortex-X925 is also disadvantaged for this, by having 4-time smaller vector registers than Zen 5.

by my12313 hours ago|

parent|

[-]

> While the parent article shows AMD Zen 5 having significantly better results in floating-point SPEC CPU2017, these benchmark results are still misleading, because in properly optimized for AVX-512 applications the difference between Zen 5 and Cortex-X925 would be much greater. I have no idea how SPEC has been compiled by the author of the article, but the floating-point results are not consistent with programs optimized for Zen 5.

The arithmetic intensity of most SPECfp subtests is quite low. You see this wall because it ends up reaching bandwidth limitations long before running out of compute on cores with beefy SIMD.

by hajile5 hours ago|

parent|

prev|

[-]

SIMD workloads on CPU tend to be bursty. If your workload is all SIMD with few other instructions or branches, it's almost certainly going to be faster on a GPU or SME co-processor.

If there's space between the SIMD instructions, then double-pumping or even quad-pumping isn't very expensive (and with 6 SIMD ports, it might even be basically free).

by CyberDildonics10 hours ago|

parent|

prev|

[-]

I don't know where the focus on vector instructions comes from. 6 128-bit instructions per clock is not bad at all. 512 bit wide vector instruction being used are exotic.

What most people want is interactivity and fast web pages which doesn't have much to do with wide vector instructions (except possibly for optimized video decoding).

by DeathArrow14 hours ago|

parent|

prev|

[-]

Still, what percentage of software uses AVX512 for its core functionality, so vector performance matters in practice?

by galangalalgol13 hours ago|

parent|

[-]

Auto vectorizing optimizers have gotten quite good. If you are using integers it often just happens whether you think about it or not. With floats unless you specify fast math you will need to use wide types to let it know you don't care about floating point addition order.

by barrkel11 hours ago|

prev|

[-]

In my view, power consumption isn't relevant to a desktop or workstation (and increasingly, desktop machines are workstations since almost everyone uses laptops instead). When I'm plugged into a wall socket, I will take performance over efficiency at every decision point. Power consumption matters to the degree that the resulting heat needs to be dissipated, and if you can't get rid of the heat fast enough, you lose performance.

by rbanffy3 hours ago|

parent|

[-]

There is a whole universe of good-enough desktop computers that doesn't care that much about performance, but where power consumption is important, because it makes the computer bulky, noisy, and expensive.

I'd love to have a Xeon 6, a big EPYC, or an AmpereOne (or a loaded IBM LinuxOne Express) as my daily driver, but that's just not something I can justify. It'd not be easy to come up with something for all this compute capacity to do. A reasonable GPU is a much better match for most of my workloads, which aren't even about pushing pixels anymore - iGPUs are enough these days - but multiplying matrices with embarrassingly low precision, so it can pretend to understand programming tasks.