https://github.com/protocolbuffers/protobuf/blob/44025909eb7...
1. Check for dense list O(1) 2. Check upper bound 3. Constant trip count binary search
The constant trip count is great for the branch predictor, and the core loop is pretty tightly optimized for the target hardware, avoiding multiplies. Every attempt to get more clever made the loop worse and did not pay for itself. It's hard because it's an array-of-structs format with a size of 12, and mostly pretty small N.
I know people train themselves into grokking this and reading and emitting this way, but it sounds like writing "bork bork bork bork" runes to me.
I'm glad Rust feels more like Ruby and Python and that method and field names are legible.
My eyes just glaze over:
UPB_API_INLINE
const struct upb_MiniTableField* upb_MiniTable_FindFieldByNumber(
const struct upb_MiniTable* m, uint32_t number) {
const uint32_t i = number - 1; // 0 wraps to UINT32_MAX
// Ideal case: index into dense fields
if (i < m->UPB_PRIVATE(dense_below)) {
UPB_ASSERT(m->UPB_ONLYBITS(fields)[i].UPB_ONLYBITS(number) == number);
return &m->UPB_ONLYBITS(fields)[i];
}
// Early exit if the field number is out of range.
uint32_t hi = m->UPB_ONLYBITS(field_count);
uint32_t lo = m->UPB_PRIVATE(dense_below);
UPB_ASSERT(hi >= lo);
uint32_t search_len = hi - lo;
if (search_len == 0 ||
number > m->UPB_ONLYBITS(fields)[hi - 1].UPB_ONLYBITS(number)) {
return NULL;
}
// Slow case: binary search
const struct upb_MiniTableField* candidate;
#ifndef NDEBUG
candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
m, lo, search_len, number);
UPB_ASSERT(candidate ==
UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number));
#elif UPB_ARM64_ASM
candidate = UPB_PRIVATE(upb_MiniTable_ArmOptimizedLowerBound)(
m, lo, search_len, number);
#else
candidate = UPB_PRIVATE(upb_MiniTable_LowerBound)(m, lo, search_len, number);
#endif
return candidate->UPB_ONLYBITS(number) == number ? candidate : NULL;
}Honestly I don't see much difference between
upb_MiniTable_FindFieldByNumber
and upb::MiniTable::FindFieldByNumber npm -g i package-name
Like why would you teach people to do this? I understand people needed to save precious bytes in the sixties so we have cat and ls but saving 192 bytes or whatever with shorter variable names is not a worthwhile tradeoff anymore.I did not bookmark it and about twice a year I go searching for it again. Some say he’s still searching to this day.
That's why b-trees are the standard in databases. The data could be anything, and its characteristics could massively change at any time, as you suddenly import a whole bunch of new rows at once.
And while you can certainly design algorithms around e.g. gradient descent to try to accelerate lookup, b-trees are already incredibly fast, and have lots of other benefits like predictable worse-case performance and I/O requirements, supporting range scans, ordered traversal, prefix conditions, etc.
So yes, you can certainly design lookup algorithms that are more efficient for particular data distributions, but they will also often lack other important properties. And b-trees are already so fast, improvements are often negligible -- like even if another algorithm produces a closer initial guess, it may be slower to locate the final item, or it may be faster on average but have horrible worst-case performance that makes it unusable.
Even with a paper dictionary, I've always used pretty much a binary search beyond the first initial guess, which only saves you a couple of hops. And actually, once I get to the right handful of pages I'm probably more linear than I should be, and I'd probably be faster if I tried to do a rigorous binary search, but I have to balance that with how long it takes to flip pages.
[1]: https://dl.acm.org/doi/epdf/10.1145/1411203.1411220
Also if you do not learn anything about the data while performing the binary search, no? Like, if you are constantly below the estimate, you could gess that the distribution is biases toward large values and adjust your guess based on this prediction.
Yes, absolutely!
I forgot to share this general perspective above, and it's too late to edit, so I'll add it here...
Since binary search assumes only monotonicity; splitting your interval into two equal parts extracts one bit of information per step, and any other choice would extract less information on average. One bit of information per step is how you end up needing log(n) steps to find the answer.
To accelerate your search, you basically need to extract those log(n) bits as fast as you can. You can think of that as leveraging both the prior, and everything you learn along the way -- to adaptively design each step to be the optimal experiment to extract maximum amount of information. And adaptive local models of your search space (gradient / hessian / etc) allow you to extract many more bits of information from each query / experiment, provided the function you are inverting has some local structure.
PS: That is why we leverage these ideas to "search" for the optimum, among a space of solutions.
If we would guess that there is a bias in the distribution based on recently seen elements, the guess is at least as likely to be wrong as it is to be right. And if we guess incorrectly, in the worst case, the algorithm degrades to a linear scan.
Unless we have prior knowledge. For example: if there is a particular distribution, or if we know we're dealing with integers without any repetition (i.e. each element is strictly greater than the previous one), etc.
This is true for abstract and random data. I don't think it's true for real world data.
For example, python's sort function "knows nothing" about the data you're passing in. But, it does look for some shortcuts and these end up saving time, on average.
You have another piece of information, you don't only know if the element was before or after the compared element. You can also know the delta between what you looked at and what you're looking for. And you also have the delta from the previous item you looked at.
Actually deciding what to do with that information without incurring a bunch more cache misses in the process may be tricky.
Depending on the problem space such assumptions can prove right enough to be worth using despite sometimes being wrong. Of course if you've got the compute to throw at it (and the problem is large) take the Contact approach: why do one when you can do two in parallel for twice the price (cycles)?
Thinking about it--yeah, if you can anticipate anything like a random distribution it's a few extra instructions to reduce the number of values looked up. In the old days that would have been very unlikely to be a good deal, but with so many algorithms dominated by the cache (I've seen more than one case where a clearly less efficient algorithm that reduced memory reads turned out better) I suspect there's a lot of such things that don't go the way we learned them in the stone age.
You can do better if the list is stable by reusing information.
But gathering that information during searches is going to require great complexity to leverage, as searches are an irregular information gathering scheme.
So create RAM for speedup optimizations up front.
1) Create a table that maps the first 8 bits to upper and lower indexes in the list. Then binary search over the last 8 bits. That reduces the search time in half.
2) Go all the way, and create an array of 32,768 indexes, with all 1's for misses. Either way, search returns O(1).
Stable lists allow for sliding parametric trade offs between RAM-lookup vs. binary search. From full lookup, to full binary.
It is both obvious and profound, the more information you already have, the more information you already have.
Never thought about it this way. Brilliant!
IOW your prior on the data distribution lets you skip the first 4-5 binary chops.
Do you mean using a better estimator for the median value? Or something else?
You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.
That's a prior about the distribution, if a relatively weak one (in some sense, at least).
It's an interpolation search. You interpolate the values you evaluated by whatever method you'd like. No one forces you to do linear interpolation. You can very easily fit a quadratic polynomial with the last 3 points, for example.
Interpolation search seems to have a convergence rate of log log n. That's pretty efficient.