undefined

points

[-]

It's trickier than that. Modern processors are speculative, which means that they guess at the result for a comparison and keep going along one side of a branch as far as they can until they are told they guessed wrong or hit some internal limit. If they guessed wrong, they throw away the speculative work, take a penalty of a handful of cycles, and do the same thing again from a different starting point.

Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.

The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.

As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.

by drob51821 hours ago|

parent|

[-]

Yea, I get that the actual comparison instruction itself is insignificant. It's everything that goes along with it. Seems like quaternary is fetching more data, however.

For instance, if you have 8 elements, 01234567, and you're looking for 1, with binary, you'd fetch 4, 2, and then 1. With quaternary, you'd fetch 2, 4, 6, then 1. Obviously, if you only have 8 elements, you'd just delegate to the SIMD instruction, but if this was a much larger array, you'd be doing more work.

I guess on a modern processor, eliminating the data dependency is worth it because the processor's branch prediction and speculation only follows effectively a single path.

Would be interesting to see this at a machine cycle level on a real processor to understand exactly what is happening.

by LoganDark20 hours ago|

parent|

[-]

It's not about doing more or less work; it's about doing the work faster. For instance, it's relatively common to discover that some recomputation can be faster than caching or lookup tables. Similarly, fetching more from memory also can be faster if it means you make less roundtrips.

by crdrost19 hours ago|

parent|

[-]

Well that's where I thought this link was going to go before it went down the simd path... We have a way to beat binary search, it is called b-trees, it has the same basic insight that you can easily take 64 elements from your data set evenly spaced, compare against all of those rapidly, and instead of bifurcating your search space once, you do the same as six times, but because you store the 64 elements in an array in memory, they only take one array fetch and you get cache locality... But as you have more elements, you need to repeat this lookup table like three or four or five times, so it costs a bit of extra space, so what if we make it not cost space by just storing the data in these lookup tables...

by LoganDark5 hours ago|

parent|

[-]

A B-tree is not a search algorithm though, it is a data structure. While it would nice to be able to somehow instantly materialize a B-tree from a linear array, CPUs aren't quite there yet. It would also be nice not to have to deal with linear arrays where B-trees would be better fit in the first place, but we are not quite there yet either.

by wtallis22 hours ago|

prev|

[-]

Yes, this can be seen as unrolling the loop a bit. It improves performance not by significantly reducing the number of instructions or memory reads, but by relaxing the dependencies between operations so that it doesn't have to be executed purely serially. You could also look at it as akin to speculatively executing both sides of the branch.

by mayoff22 hours ago|

prev|

[-]

Quaternary search effectively performs both of the next loop iteration’s possible comparisons simultaneously with the current iteration’s comparison. This is a little more complex than simple loop unrolling.

Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.

by loeg21 hours ago|

prev|

[-]

Sort of, yes, but you're also removing a data dependency between the unrolled stages.

by pfortuny21 hours ago|

prev|

[-]

It is because processors do not do what one might naively think they do.