However, on some processors there's a data-dependent prefetcher that will notice the pointer-like value and start prefetching that address before the CPU requests it.
At this point I'm kinda expecting CPU vendors to stop putting as many Spectre mitigations in the main core, and just have a small crypto core with full-fat arithmetic, less hardware for memory access, less speculation, and careful side-channel hardening. You still have to block Meltdown and other large vulnerabilities on the main cores, but if someone wants to protect elliptic curves from weird attacks? Try to set the DIT bit, trap into the OS, and get sent to the hardened core.
Well in a language that allows you to generate compile time specialized serde code like Nim or Zig. Maybe C++ has enough compile time reflection to do it now as well?
I don't know enough about Rust's serde, but it seems like there'd be a lot of performance overhead with it's design and the limits of Rust's macro and compiler system.
*maddeningly, with mutually incompatible sorting rules.
You can avoid this with two arrays. One contains random query positions, and the target array is also filled with random values. The next index is then a function of the next query position and the previous value read from the target array.
Eg start with every element referencing the next element (i.e. i+1 with wrap-around), and then use a random shuffle. That way, you preserve the full cycle.
Basically, pick a large prime number p as the size of your array and a number 0 < x < p. Then visit your array in the order of (i*x) modulo p.
You can also do something with (x^i) modulo p, if your processor is smart enough to figure out your additive pattern.
Basically, the idea is to look into the same theory they use to produce PRNG with long cycles.
Another thing I noticed is that the spike on the left hand side of his graphs is the overhead of file access.
Without this overhead, small array random access should have a lot better per-element cost.
Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
CPU implementation has become too complex to grasp. The only sure way to know how a CPU will behave for a given workload is to run the workload. It's good to have some basic expectations of performance, instructions/cycle, memory bandwidth, to detect if something is off. I guess I'm trying to say it's hard to keep in your head all the details of what ~1B transistors are doing together to run your code. It's just too big.
This is a very small Nim program to demonstrate for "show me the code" and "it must just not be 'random enough'!" skeptics: https://github.com/c-blake/bu/blob/main/memlat.nim It uses the exact dependency idea @andersa mentions of a random cycle of `x[i] = i` that others else-sub-thread say some CPUs these days are smart enough to "see through". On Intel CPUs I have, the dependency makes things 12x slower at the gigabyte scale (DIMMs).
EDIT: This same effect makes many a naive hash table microbenchmark { e.g., `for key in keys: lookup(key)` } unrepresentative of performance in real programs where each `key` is often not speculatively pre-computable.
Traversing a contiguous list of pointers in L1 is also slower than accessing those pointers by generating their address sequentially, so adding a load-load dependency is not a good way to benchmarking random access vs sequential access (it is a good way to benchmark vector traversal vs list traversal of course).
At the end of the day you have to accept that like caching and prefetching speedup sequential access, OoO execution[1] will speedup (to a lesser extent) random access. Instead of memory latency, in this case the bottleneck would be the OoO queue depth, or more likely the maximum number of outstanding L1/L2/L3 (and potentially TLB) misses. As long as the maximum number of outstanding misses is lower than the memory latency for that cache level, then, in first approximation, the cpu can effectively hide the sequential vs random access cost for independent accesses.
Benchmarking is hard. Making sure that that a microbenchmark represents your load effectively, doubly so.
[1] Even many in-order CPUs have some run-ahead capabilities for memory.
I still suspect @kortilla is one of today's lucky 10,000 (https://xkcd.com/1053/) or just read/replied too quickly. :-)
There is a lot written that indicates that the complexity of modern CPUs is ill-disseminated. But there is also wonderful stuff like https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... { To add a couple links to underwrite my reply in full agreeance. :-) }