upvote
https://github.com/sahib/rmlint is the one I had in mind.

> Those use a rather expensive hash function (you really want to avoid hash collisions), [...]

Then we are clearly not thinking of the same kind of software.

> but (at least some ten years ago) memory, not processing speed, was the limiting factor.

In what I described, IO is the limiting factor. You want to avoid having to read the whole file, if you can.

I think you are thinking of block level online deduplicators that are integrated into the file system?

reply
> https://github.com/sahib/rmlint is the one I had in mind.

Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.

> I think you are thinking of block level online deduplicators that are integrated into the file system?

Indeed, I was.

reply
> Ah, right, thanks. I now dimly recall some old project realizing fs-snapshots using hard links, which one could consider some sort of deduplication as well.

Most modern CoW filesystems also allow you to mark two files as duplicates, but without sharing subsequent mutations between them. Rmlint supports that, too.

Btw, I'm working on adding deduplication to Bcachefs, and because it's extent-based and not blockbased, the logic will look a lot more like rmlint than what you described.

reply