When you have a 12 inch ruler, you effectively have 13 numbers on the ruler. The fact that zero isn't marked is neither here nor there -- the numeral one is not at the far end of the ruler.
So if you extend the ruler to be as long as you can hold in eight bits, it will range from 0 to 255, and the total length will be 255.
The ruler analogy may seem overly simplistic, but then the real world is likewise fairly simplistic.
At the end of the day, the numbers presumably come from a sensor, or go to a display, and, often, in either case, zero represents as dark as you can get and 255 represents as light as you can get, so the physics dictate that the intervals associated with the 0 and 255 are half the size of the rest of the intervals.
Audio is more interesting than video, because in audio, you care deeply about not having an offset, and about having a balanced signal, so the question of whether the midpoint is actually on a number or not is pertinent.
In audio, it is often useful to simply discard a code so that 0 is the midpoint (e.g -65535 to +65535, discarding 0xFFFF). But this still gives you smaller intervals at both ends.
https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...
https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...
In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.
3x faster
In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.
50% faster
For real usage, today's CPUs are limited by memory bandwidth.
// color4_t result = {
// .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
// .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
// .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
// .a = src.a + (dst.a * inv_alpha) * INV_255
// };
// 1/256 but much faster
color4_t result = {
.r = (src.r * src.a + dst.r * inv_alpha) >> 8,
.g = (src.g * src.a + dst.g * inv_alpha) >> 8,
.b = (src.b * src.a + dst.b * inv_alpha) >> 8,
.a = src.a + ((dst.a * inv_alpha) >> 8)
};Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?
I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.
It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).
And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.
P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.
Also, you should use SIMD.
The reason is that year 0 never existed. The year 1 BCE was followed by the year 1 CE.
Culturally, anthropologically, and psychologically it might be a different matter. But 2000 years had not passed before the end of that year.
The calendar was back-dated 500 or so years after Jesus, by a European guy before Europe had the concept of zero, leaving us with 1-indexed years. Then, 200 or so years after that, another guy (still lacking the concept of zero) made the even less venerable decision that the year right before 1 AD would be 1 BC.
We could just decide today that 0 came right before 1 AD and was the first year of the first century AD. Then we’d just have to shift all BC dates by 1 year in all our history books.
The upside would be that arithmetic on year labels starts working again. The downside is that there are way too many history books and no one will ever do this.
Of course, the easier way out is to just decide today that either 1) the first century began in 1 BC or 2) the first century had 1 fewer year than all the other centuries.