undefined

points

[-]

You are right!

For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)

The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.

Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.

These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.

by guenthert2 hours ago|

parent|

[-]

Now I'm no expert in that matter, but the fs deduplicators I've seen were block, not file, based. Those can clearly not use the file length as they are blissfully unaware of files (or any structure for that matter). Those use a rather expensive hash function (you really want to avoid hash collisions), but (at least some ten years ago) memory, not processing speed, was the limiting factor.

by eru1 hours ago|

parent|

[-]

https://github.com/sahib/rmlint is the one I had in mind.

> Those use a rather expensive hash function (you really want to avoid hash collisions), [...]

Then we are clearly not thinking of the same kind of software.

> but (at least some ten years ago) memory, not processing speed, was the limiting factor.

In what I described, IO is the limiting factor. You want to avoid having to read the whole file, if you can.

I think you are thinking of block level online deduplicators that are integrated into the file system?

by guenthert1 hours ago|

parent|

[-]

> https://github.com/sahib/rmlint is the one I had in mind.

Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.

> I think you are thinking of block level online deduplicators that are integrated into the file system?

Indeed, I was.

by eru26 minutes ago|

parent|

[-]

> Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.

Most modern CoW filesystems also allow you to mark two files as duplicates, but without sharing subsequent mutations between them. Rmlint supports that, too.

Btw, I'm working on adding deduplication to Bcachefs, and because it's extent-based and not blockbased, the logic will look a lot more like rmlint than what you described.

by andai3 hours ago|

prev|

[-]

This comment feels about half as long as it ought to be. Can you say more?