The reason I chose "time per element" then follows from that different goal, because I was comparing across vastly different array sizes, so no other metric I could think of would have really worked for the charts I was drawing.
Using something like the overall run length would have such large variations making only the shape of the graph particularly useful (to me) less so much the values themselves.
If I was showing a chart like this to "leadership" I'd show with the overall run length. As I'd care more about them realizing the "real world" impact rather than the per unit impact. But this is written for engineers, so I'd expect it to also be focused on per unit impacts for a blog like this.
However, having said all that, I'd love to hear what your reservations are using it as a metric.
Perhaps it’s a long time inspiration from this post: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...
I also just don’t know what to do with “1 ns per element”. The scale of 1 to 4 ns per element is remarkably imprecise. Discussing 1 to 250 million to 1 billion elements per second feels like a much wider range. Even if it’s mathematically identical.
Your graphs have a few odd spikes that weren’t deeply discussed. If it’s under 2ns per element who cares!
The logarithmic scale also made it really hard to interpret. Should have drawn clearer lines at L1/L2/L3/ram limits.
On skim I don’t think there’s anything wrong. But as presented it’s a little hard for me as an engineer to extract lessons or use this information for good (or evil).
There shouldn’t be a Linux vs Mac issue. Ignoring mmap this should be HW.
I dunno. Those are all just surface level reactions.
Agreed that the odd spikes don't matter, that's why I didn't bother discussing them; I was more interested in the data after the array got large enough that random access was actually slower. It looked like all those weird spikes were for arrays small enough to fit in cache anyways.
I agree that it could have been helpful if I'd drawn lines at L1/L2/L3/RAM limits, but I didn't do that because I don't think it's entirely clear where those lines should have been drawn. Specifically because there are two arrays. Should the line show just where the floating-point array is small enough to fit in cache, or where both arrays together are?
Not sure I quite follow what you're saying about mmap on Linux vs Mac; only one of the three sets of experiments used mmap, and the third was explicitly to try to tease out that effect. Especially for the first experiment, I agree that there should be no difference for arrays small enough to fit in RAM, since the whole file gets read into memory first.
> Random access from the cache is remarkably quick. It's comparable to sequential RAM performance
That's actually expected once you think about it, it's a natural consequence of prefetching.
Lots of things are expected when you deeply understand a complex system and think about it. But, like, not everyone knows the system that deeply nor have they thought about it!