upvote
I'm confused by that analogy. Is the “ruler” a 255-inch ruler with 256 points labeled 0–255, or is it a 256-inch ruler with 256 1-inch segments, making L = 256×1?
reply
The analogy is pretty straightforward.

When you have a 12 inch ruler, you effectively have 13 numbers on the ruler. The fact that zero isn't marked is neither here nor there -- the numeral one is not at the far end of the ruler.

So if you extend the ruler to be as long as you can hold in eight bits, it will range from 0 to 255, and the total length will be 255.

The ruler analogy may seem overly simplistic, but then the real world is likewise fairly simplistic.

At the end of the day, the numbers presumably come from a sensor, or go to a display, and, often, in either case, zero represents as dark as you can get and 255 represents as light as you can get, so the physics dictate that the intervals associated with the 0 and 255 are half the size of the rest of the intervals.

Audio is more interesting than video, because in audio, you care deeply about not having an offset, and about having a balanced signal, so the question of whether the midpoint is actually on a number or not is pertinent.

In audio, it is often useful to simply discard a code so that 0 is the midpoint (e.g -65535 to +65535, discarding 0xFFFF). But this still gives you smaller intervals at both ends.

reply
Fencepost errors aren't errors if you are actually trying to count fenceposts.
reply
yes but >> 8 is so much faster
reply
You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.
reply
Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float
reply
If your input is an arbitrary float, you need to check for denormals (and maybe NaNs). You can do bitmasking trick to avoid conditional jumps but I'm skeptical you can do it faster than SIMD multiply instruction.
reply
It's just multiplication. Floating multiply is extraordinarily fast.
reply
The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable
reply
It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

reply
Shift right isn't even relevant here - if you shift before conversion to float all your values end up 0 and if you want to divide afterwards its no longer a simple shift.
reply
Exactly. Although if you do >> 8 while working with uint8, it will be the fastest :)
reply
It's 3 cycles for float multiplication (and 1 for shift right):

3x faster

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

50% faster

reply
FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.
reply
Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.
reply
That's only valid to do if the reciprocal is representable exactly.
reply
That's not totally true. It's sufficient to be exactly representable, but you only need the reciprocal rounding error to be small enough to guarantee the multiplication rounding step fixes it across the entire range of numerators. For IEEE754 f16 values, there are 28 such extra values, the positive and negative sides of 1705/x where x is a power of 2 at least as great as 2048.
reply
Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)
reply
Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

reply
What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };
reply
And both are wrong since the values would have to be in a linear color space for for the compositing math to make sense. But in some non-linear space to be useful when mapped to 0..255 (e.g non-linear sRGB).

Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?

I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.

It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).

And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.

P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.

reply
If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.
reply
How is this supposed to be 10x faster if all you did was drop one out of three multiplications?
reply
Because you are working in the cache.

Also, you should use SIMD.

reply
> Also, you should use SIMD. ironically no clang is better at auto vectorizing
reply
Better than what? And do you use `-mavx2` or do you let it target baseline x86_64 and miss out on 8-float vectors? How do you make sure its autovectorisation is successful?
reply
[dead]
reply
But who says that the numbers are representing the points, rather than representing the intervals between the points?
reply
It doesn't even need to represent intervals. A 13 inch ruler with 13 markings at 0.5, 1.5, etc inches is still a valid ruler, albeit an odd construction.
reply
I’m dumb. Doesn’t 0 start at the beginning?
reply
It's right up there with the confusion if 2000 was the new year of the 21st century or the last year of the 19th century.
reply
For the record, the mathematically correct answer to this question is that the year 2000 was the last year of the 19th century.

The reason is that year 0 never existed. The year 1 BCE was followed by the year 1 CE.

Culturally, anthropologically, and psychologically it might be a different matter. But 2000 years had not passed before the end of that year.

reply
What makes this argument less compelling is that “year 1 AD” also didn’t exist at the time, and this isn’t a great reason to abandon the arithmetically sane approach of zero-indexed year numbering.

The calendar was back-dated 500 or so years after Jesus, by a European guy before Europe had the concept of zero, leaving us with 1-indexed years. Then, 200 or so years after that, another guy (still lacking the concept of zero) made the even less venerable decision that the year right before 1 AD would be 1 BC.

We could just decide today that 0 came right before 1 AD and was the first year of the first century AD. Then we’d just have to shift all BC dates by 1 year in all our history books.

The upside would be that arithmetic on year labels starts working again. The downside is that there are way too many history books and no one will ever do this.

Of course, the easier way out is to just decide today that either 1) the first century began in 1 BC or 2) the first century had 1 fewer year than all the other centuries.

reply
We could also just define that 0 AD = 1 BC and don't have to rewrite any BC dates.
reply
The debate is if 2000 is the first year of the 21st century or the last year of the 20th century. (btw I agree with the latter)
reply
wow, yeah, that's quite the miss on my part.
reply
the correct way is to use a slide rule
reply