Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?
I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.
It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).
And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.
P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.
Also, you should use SIMD.