The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
> Those use a rather expensive hash function (you really want to avoid hash collisions), [...]
Then we are clearly not thinking of the same kind of software.
> but (at least some ten years ago) memory, not processing speed, was the limiting factor.
In what I described, IO is the limiting factor. You want to avoid having to read the whole file, if you can.
I think you are thinking of block level online deduplicators that are integrated into the file system?
Ah, right. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.
> I think you are thinking of block level online deduplicators that are integrated into the file system?
Indeed, I was.
h = 5381
while still has data:
h = h * 33 + next_byte()
return h
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.Nevertheless, the purposes of hash functions and PRNGs are different and complementary.
A PRNG receives a short fixed-length value (the seed) and it expands it into a long pseudo-random sequence of arbitrary length.
A hash function receives a long input sequence of arbitrary length and it generates a short fixed-length pseudo-random value.
Good PRNGs are injective functions and good hash functions are surjective functions.
Normally the design methods for PRNGs and for hash functions should be presented together, because it is easy to interconvert algorithms for one of them with algorithms for the other. For instance, given a good hash function one could make a PRNG by computing the hashes of a sequence of numbers or the hashes of a sequence of strings of increasing length, and given a good PRNG one could make a hash function by accumulating somehow the input into a PRNG seed and taking the first generated number, or better by using input chunks as seeds and then accumulating the first generated numbers into a single value.
However for a successful conversion between PRNG and hash function algorithms, the source algorithm may have have to be overdesigned, to guarantee good enough properties even after the conversion.
When an algorithm is designed directly as a hash function or as a PRNG, with clearly specified requirements, it can be designed only as good as strictly necessary, enabling thus a better performance.
Hashes are _functions_ so provide the same output given the same input.
If you don't reseed the RNG after every hash computation, then you break this vital property of hashes.
And if you do reseed, then your claim boils down to "every hash function is just an XOR against a contstant" which certainly is not true either.
That might we one very particular way to write a hash function, but it's far from the only one.
Believe it or not, for some purposes taking the first few bytes of a string or even just the length of the string are good hash functions.
What are those purposes?
Another use:
Suppose you write a tool like rmlint that is looking for duplicate files. Generally, you compute some hash for each file, see if you got any duplicates, and then compare the relevant files directly.
A traditional hash like crc or sha256 takes O(n) to compute. But for files you can start with some cheaper hashes, like file length. After all, files of different length can't have the same content. Taking the first few bytes of your file is another cheap 'hash' you can compute.
Only when these cheap 'hashes' show that you have a potential duplicate, do you go and pay for a more expensive hash.
For instance, if the last operation during computing a hash function was the kind of integer multiplication where the result has the same length as the operands, then the top bits are the best (because they depend on more of the input bits than the bottom result bits).
On the other hand, if the last operation during computing a hash function was the kind of integer multiplication where the result has a length that is the sum of the lengths of the operands (i.e. where the high bits of the result are not discarded), then the best bits of the result are neither the top bits nor the bottom bits, but the middle bits, for the same reason as before, they are the bits that depend on more of the input bits than either the bottom bits or the top bits of the result. (This fact becomes obvious if you do a long multiplication with pen on paper. After you compute the partial products and you start to add them, you can see that when computing either the top digits or the bottom digits you need to add much less digits from the partial products than when you compute the middle digits, so the middle digits are more thoroughly mixed as functions of the input digits.)
When the last operation is an addition, because the carry bits propagate from the bottom bits towards the top bits, the top bits are the best, because the higher a bit is in the result there are more input bits on which it depends.
And since most languages require each data type to provide its own hash function, you kind of have to assume that the hash is half-assed and bottom bits are better. I think only Rust could make decisions differently here, since it's parametric over hashers, but I haven't seen that done.
Nowadays swisstable and other similar hashtables use the top bits and simd/swar techniques to quickly filter out collisions after determining the starting bucket.
When other operations are used, there may be other bits with higher entropy. For example, when full-length multiplication is used (i.e. where the length of the result is the sum of the lengths of the operands), the middle bits have the highest entropy, not the top bits. On CPUs like the Intel/AMD x86-64 CPUs, where fast long multiplication instructions are available, this can be exploited in more performant hash functions and PRNGs.
In hash functions, additions and multiplications are frequently used together with rotations, in order to redistribute the entropy from the top bits to the bottom bits, during the following operations.
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
def hash(str):
len(str)
O(1), baby!In something like C with its generic strings[1], it would surely have to be O(n) since you have to scan the entire string to calculate its length?
(I have always been terrible at big-O, mind.)
[0] There's probably more of them by now.
[1] ie. not a specific length-stored string type.
In any case, for some applications this is indeed a great hash function. Programs like rmlint use it as part of their checks for duplicate files: if files have different lengths, they can't have the same content after all.
Actually, in Python, None is a valid key...
(I'm so sorry. JavaScript has ruined me.)
I get why technically it is a hash function, but still, no.
I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.
Whenever you map into a smaller space, you get collisions. The bigger space doesn't have to be infinite.
fn hash(data):
return data