undefined

points

[-]

In the end it depends exactly what you want to measure. Of course a load-load dependency will make everything as slow as the latency of the cache level you are accessing as that becomes the bottleneck.

Traversing a contiguous list of pointers in L1 is also slower than accessing those pointers by generating their address sequentially, so adding a load-load dependency is not a good way to benchmarking random access vs sequential access (it is a good way to benchmark vector traversal vs list traversal of course).

At the end of the day you have to accept that like caching and prefetching speedup sequential access, OoO execution[1] will speedup (to a lesser extent) random access. Instead of memory latency, in this case the bottleneck would be the OoO queue depth, or more likely the maximum number of outstanding L1/L2/L3 (and potentially TLB) misses. As long as the maximum number of outstanding misses is lower than the memory latency for that cache level, then, in first approximation, the cpu can effectively hide the sequential vs random access cost for independent accesses.

Benchmarking is hard. Making sure that that a microbenchmark represents your load effectively, doubly so.

[1] Even many in-order CPUs have some run-ahead capabilities for memory.

by cb3218 hours ago|

parent|

[-]

All true. No real disagreement and I'm often saying "it all depends.." myself. :-) In this case, there is also some vagueness around "random" (predictable to what subsystem when).

I still suspect @kortilla is one of today's lucky 10,000 (https://xkcd.com/1053/) or just read/replied too quickly. :-)

There is a lot written that indicates that the complexity of modern CPUs is ill-disseminated. But there is also wonderful stuff like https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... { To add a couple links to underwrite my reply in full agreeance. :-) }