upvote
What are your qualms with time per element? I liked it as a metric because it kept the total deviation of results to less than 32 across the entire result set.

Using something like the overall run length would have such large variations making only the shape of the graph particularly useful (to me) less so much the values themselves.

If I was showing a chart like this to "leadership" I'd show with the overall run length. As I'd care more about them realizing the "real world" impact rather than the per unit impact. But this is written for engineers, so I'd expect it to also be focused on per unit impacts for a blog like this.

However, having said all that, I'd love to hear what your reservations are using it as a metric.

reply
It’s not wrong per se. I’m just very wary of nano-scale benchmarks. And I think in general you should advertise “velocity” not “time per”.

Perhaps it’s a long time inspiration from this post: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...

I also just don’t know what to do with “1 ns per element”. The scale of 1 to 4 ns per element is remarkably imprecise. Discussing 1 to 250 million to 1 billion elements per second feels like a much wider range. Even if it’s mathematically identical.

Your graphs have a few odd spikes that weren’t deeply discussed. If it’s under 2ns per element who cares!

The logarithmic scale also made it really hard to interpret. Should have drawn clearer lines at L1/L2/L3/ram limits.

On skim I don’t think there’s anything wrong. But as presented it’s a little hard for me as an engineer to extract lessons or use this information for good (or evil).

There shouldn’t be a Linux vs Mac issue. Ignoring mmap this should be HW.

I dunno. Those are all just surface level reactions.

reply
Haha, it seems you may have thought the person you were responding to is the post author :) but actually that would be me.

Agreed that the odd spikes don't matter, that's why I didn't bother discussing them; I was more interested in the data after the array got large enough that random access was actually slower. It looked like all those weird spikes were for arrays small enough to fit in cache anyways.

I agree that it could have been helpful if I'd drawn lines at L1/L2/L3/RAM limits, but I didn't do that because I don't think it's entirely clear where those lines should have been drawn. Specifically because there are two arrays. Should the line show just where the floating-point array is small enough to fit in cache, or where both arrays together are?

Not sure I quite follow what you're saying about mmap on Linux vs Mac; only one of the three sets of experiments used mmap, and the third was explicitly to try to tease out that effect. Especially for the first experiment, I agree that there should be no difference for arrays small enough to fit in RAM, since the whole file gets read into memory first.

reply
> Agreed that the odd spikes don't matter, that's why I didn't bother discussing them

That was sarcasm =P Those spikes are very curious and the choice of presentation makes them seem like noise but there is something there that should be investigated further imho. In the graph it looks like noise. I mean it’s just 1ns. But a 2x throughput difference isn’t noise! Thats huge! Very curious.

> Not sure I quite follow what you're saying about mmap on Linux vs Mac

Your 4th conclusion is “On Linux, random order starts getting even slower for arrays over a gigabyte, becoming more than 50x slower than first-to-last order; in contrast, random order on the MacBook seems to just level out as long as everything fits in RAM.”. That doesn’t make sense. There shouldn’t be any OS difference here.

reply
Gotcha, sorry for not picking up on the sarcasm. Yeah, I mean, I didn't really bother running the experiments many times for the smaller array sizes, so it could potentially be interesting to see if those artifacts persist when poked.

Could you clarify why there shouldn't be an OS difference? I was under the impression that it's the OS that handles how swap space is implemented (which was used by the first set of experiments), as well as how memory-mapped files are implemented (which was used by the second set of experiments). Am I mistaken about that?

reply
> Could you clarify why there shouldn't be an OS difference?

Ignoring mmap. But why on a 16gb system why would performance degrade at 1Gb? You shouldn’t be hitting the swap. So any differences should be hardware architecture related. M1 unified vs Ryzen. And I wouldn’t expect 1Gb to be a magic threshold.

I would definitely expect a threshold beyond 16gb. And I’d expect the swap to come into play at maybe 12gb. I wouldn’t expect a huge difference between 500mb and 8gb. Ok there’s probability difference of hitting the L3 cache. But most of those random accesses will be hitting system RAM so it should be the same.

Could be wrong! But that’s what I’d expect.

reply
Ohh you're right thanks, I mixed up which part of the post you were talking about; sorry for getting confused.

Yes, I agree that it doesn't make sense for that to be due to the OS. That's just poor writing on my part: in that case I wasn't actually trying to imply that it was due to the operating system specifically, I was just using "Linux" and "the MacBook" as shorthand to refer to my two different computers, which differ in more ways than just the OS. In this case, another commenter suggested it might be due to my physical setup of having three RAM sticks: https://news.ycombinator.com/item?id=44397214

So yes, in that sentence I should have written "on my desktop" instead of just "on Linux".

reply
Could it be a TLB issue? Page size on Mac is 4x larger than on Intel.
reply
Great post, thanks for the link! I think you and I were just focusing on different things. You gave a broader discussion of the topic from a few different angles, with more specific basic numbers about CPUs as well as more realistic benchmarks than what I have here. I just wanted to focus on the simplest example I could think of, and run it on as wide a range of different array sizes as I could.

The reason I chose "time per element" then follows from that different goal, because I was comparing across vastly different array sizes, so no other metric I could think of would have really worked for the charts I was drawing.

reply
From your blog post:

> Random access from the cache is remarkably quick. It's comparable to sequential RAM performance

That's actually expected once you think about it, it's a natural consequence of prefetching.

reply
Heh. That line often gets called out.

Lots of things are expected when you deeply understand a complex system and think about it. But, like, not everyone knows the system that deeply nor have they thought about it!

reply
If that wasn't the case the machine would have to prefetch to register file. I don't know of any CPU that does that.
reply
Whats most misleading is the data for the smaller sizes (1k)
reply