They’re not pointless; they’re just not the first thing to optimize.
It’s like worrying about cache locality when you have an inherently O(n^2) algorithm and could have a O(n log n) or O(n) one. Fix the biggest problem first.
Once your data layout is good and your cpu isn’t taking a 200 cycle lunch break to chase pointers, then you worry about cycle count and keeping the execution units fed.
That’s when integer tricks can matter. Depending on the micro arch, you may have twice as many execution units that can take integer instructions. And those instructions (outside of division) tend to have lower latency and higher throughput.
And if you’re doing SIMD, your integer SIMD instructions can be 2 or 4x higher throughput than float32 if you can use int16 / int8 data.
So it can very much matter. It’s just usually not the lowest hanging fruit.
Your float instructions can also be 2x the throughput if you use f16. With no need to go for specific divisors.
For values that even can pack into 8 bits, you rarely have a way to process enough at once to actually get more throughput than with wider numbers.
I'm sure there's a program where it very much matters, but my bet is on it not even mildly mattering, and there basically always being a hundred more useful optimizations to work on.
For integers the situation is better but even there, it hugely depends on your compiler and how much it cheats. You can't replace trig with intrinsics in the general case (sets errno for example), inlining is at best an adequate heuristic which completely fails to take account what the hot path is unless you use PGO and keep it up to date.
I've managed to improve a game's worst case performance better by like 50% just by shrinking a method's codesize from 3000 bytes to 1500. Barely even touched the hot path there, keep in mind. Mostly due to icache usage.
The takeaway from this shouldn't be that "computers are fast and compilers are clever, no point optimising" but more that "you can afford not to optimise in many cases, computers are fast."
Fortunately, D compilers gdc and ldc take advantage of the gcc and llvm optimizers to stay even with everyone else.
My point wasn't "don't optimize" it was "don't optimize the wrong thing".
Trying to replace a division with a bit shift is an example of worrying about the wrong thing, especially since that's a simple optimization the compiler can pick up on.
But as you said, it can be very worth it to optimize around things like the icache. Shrinking and aligning a hot loop can ensure your code isn't spending a bunch of time loading instructions. Cache behavior, in general, is probably the most important thing you can optimize. It's also the thing that can often make it hard to know if you actually optimized something. Changing the size of code can change cache behavior, which might give you the mistaken impression that the code change was what made things faster when in reality it was simply an effect of the code shifting.
A good example of this is using std::vector<bool> vs. std::vector<uint8_t> in the debug build vs release build.
vector<bool> is much slower to access (it's a dynamic bitset). If you have a hot part of the code that frequently touches a vector<bool>, you'll see a multiple X slowdown in the debug build.
However, in the release build, there is no performance difference between the two (for me at least, I'm making a fairly complicated game). The cache misses bury it.
I've used both in my pathing code and tested each in debug/release.
Even if the std:: implementation was as fast as possible, you're still adding bit manipulation on top of accessing the element, so it will be slower no matter what you do.