undefined

points

[-]

I'm confused by that analogy. Is the “ruler” a 255-inch ruler with 256 points labeled 0–255, or is it a 256-inch ruler with 256 1-inch segments, making L = 256×1?

by zephen12 hours ago|

parent|

[-]

The analogy is pretty straightforward.

When you have a 12 inch ruler, you effectively have 13 numbers on the ruler. The fact that zero isn't marked is neither here nor there -- the numeral one is not at the far end of the ruler.

So if you extend the ruler to be as long as you can hold in eight bits, it will range from 0 to 255, and the total length will be 255.

The ruler analogy may seem overly simplistic, but then the real world is likewise fairly simplistic.

At the end of the day, the numbers presumably come from a sensor, or go to a display, and, often, in either case, zero represents as dark as you can get and 255 represents as light as you can get, so the physics dictate that the intervals associated with the 0 and 255 are half the size of the rest of the intervals.

Audio is more interesting than video, because in audio, you care deeply about not having an offset, and about having a balanced signal, so the question of whether the midpoint is actually on a number or not is pertinent.

In audio, it is often useful to simply discard a code so that 0 is the midpoint (e.g -65535 to +65535, discarding 0xFFFF). But this still gives you smaller intervals at both ends.

by knappa4 hours ago|

prev|

[-]

Fencepost errors aren't errors if you are actually trying to count fenceposts.

by lacedeconstruct22 hours ago|

prev|

[-]

yes but >> 8 is so much faster

by xigoi21 hours ago|

parent|

[-]

You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.

by dheera21 hours ago|

parent|

[-]

Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float

by yongjik18 hours ago|

parent|

[-]

If your input is an arbitrary float, you need to check for denormals (and maybe NaNs). You can do bitmasking trick to avoid conditional jumps but I'm skeptical you can do it faster than SIMD multiply instruction.

by StilesCrisis22 hours ago|

parent|

prev|

[-]

It's just multiplication. Floating multiply is extraordinarily fast.

by lacedeconstruct21 hours ago|

parent|

[-]

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

by exyi21 hours ago|

parent|

[-]

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

by account427 hours ago|

parent|

[-]

Shift right isn't even relevant here - if you shift before conversion to float all your values end up 0 and if you want to divide afterwards its no longer a simple shift.

by exyi3 hours ago|

parent|

[-]

Exactly. Although if you do >> 8 while working with uint8, it will be the fastest :)

by userbinator14 hours ago|

parent|

prev|

[-]

It's 3 cycles for float multiplication (and 1 for shift right):

3x faster

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

50% faster

by Tuna-Fish21 hours ago|

parent|

prev|

[-]

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

by pixelesque20 hours ago|

parent|

[-]

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

by mgaunard21 hours ago|

parent|

prev|

[-]

That's only valid to do if the reciprocal is representable exactly.

by hansvm19 hours ago|

parent|

[-]

That's not totally true. It's sufficient to be exactly representable, but you only need the reciprocal rounding error to be small enough to guarantee the multiplication rounding step fixes it across the entire range of numerators. For IEEE754 f16 values, there are 28 such extra values, the positive and negative sides of 1705/x where x is a power of 2 at least as great as 2048.

by Sesse__21 hours ago|

parent|

prev|

[-]

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

by dist-epoch22 hours ago|

parent|

prev|

[-]

Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

by lacedeconstruct22 hours ago|

parent|

[-]

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };

by virtualritz9 hours ago|

parent|

[-]

And both are wrong since the values would have to be in a linear color space for for the compositing math to make sense. But in some non-linear space to be useful when mapped to 0..255 (e.g non-linear sRGB).

Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?

I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.

It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).

And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.

P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.

by Tuna-Fish21 hours ago|

parent|

prev|

[-]

If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.

by imtringued8 hours ago|

parent|

prev|

[-]

How is this supposed to be 10x faster if all you did was drop one out of three multiplications?

by dist-epoch22 hours ago|

parent|

prev|

[-]

Because you are working in the cache.

Also, you should use SIMD.

by lacedeconstruct22 hours ago|

parent|

[-]

> Also, you should use SIMD. ironically no clang is better at auto vectorizing

by spider-mario8 hours ago|

parent|

[-]

Better than what? And do you use `-mavx2` or do you let it target baseline x86_64 and miss out on 8-float vectors? How do you make sure its autovectorisation is successful?

by szundi22 hours ago|

parent|

prev|

[-]

[dead]

by layer819 hours ago|

prev|

[-]

But who says that the numbers are representing the points, rather than representing the intervals between the points?

by wky19 hours ago|

parent|

[-]

It doesn't even need to represent intervals. A 13 inch ruler with 13 markings at 0.5, 1.5, etc inches is still a valid ruler, albeit an odd construction.

by groundzeros201522 hours ago|

prev|

[-]

I’m dumb. Doesn’t 0 start at the beginning?

by dylan60419 hours ago|

parent|

[-]

It's right up there with the confusion if 2000 was the new year of the 21st century or the last year of the 19th century.

by simonask19 hours ago|

parent|

[-]

For the record, the mathematically correct answer to this question is that the year 2000 was the last year of the 19th century.

The reason is that year 0 never existed. The year 1 BCE was followed by the year 1 CE.

Culturally, anthropologically, and psychologically it might be a different matter. But 2000 years had not passed before the end of that year.

by tshaddox11 hours ago|

parent|

[-]

What makes this argument less compelling is that “year 1 AD” also didn’t exist at the time, and this isn’t a great reason to abandon the arithmetically sane approach of zero-indexed year numbering.

The calendar was back-dated 500 or so years after Jesus, by a European guy before Europe had the concept of zero, leaving us with 1-indexed years. Then, 200 or so years after that, another guy (still lacking the concept of zero) made the even less venerable decision that the year right before 1 AD would be 1 BC.

We could just decide today that 0 came right before 1 AD and was the first year of the first century AD. Then we’d just have to shift all BC dates by 1 year in all our history books.

The upside would be that arithmetic on year labels starts working again. The downside is that there are way too many history books and no one will ever do this.

Of course, the easier way out is to just decide today that either 1) the first century began in 1 BC or 2) the first century had 1 fewer year than all the other centuries.

by account427 hours ago|

parent|

[-]

We could also just define that 0 AD = 1 BC and don't have to rewrite any BC dates.

by tzot18 hours ago|

parent|

prev|

[-]

The debate is if 2000 is the first year of the 21st century or the last year of the 20th century. (btw I agree with the latter)

by dylan60417 hours ago|

parent|

[-]

wow, yeah, that's quite the miss on my part.

by m46317 hours ago|

prev|

[-]

the correct way is to use a slide rule