A pure library implementation that uses on normal function call semantics obviously needs to conservatively save at least all callee-save registers, but that's not the only possible implementation. An implementation with compiler help should be able to do significantly better.
Ideally the compiler would provide a built-in, but even, for example, an implementation using GCC inline ASM with proper clobbers can do significantly better.