Your float instructions can also be 2x the throughput if you use f16. With no need to go for specific divisors.
For values that even can pack into 8 bits, you rarely have a way to process enough at once to actually get more throughput than with wider numbers.
I'm sure there's a program where it very much matters, but my bet is on it not even mildly mattering, and there basically always being a hundred more useful optimizations to work on.