undefined

points

[-]

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };

by virtualritz9 hours ago|

parent|

[-]

And both are wrong since the values would have to be in a linear color space for for the compositing math to make sense. But in some non-linear space to be useful when mapped to 0..255 (e.g non-linear sRGB).

Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?

I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.

It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).

And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.

P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.

by Tuna-Fish21 hours ago|

parent|

prev|

[-]

If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.

by imtringued9 hours ago|

parent|

prev|

[-]

How is this supposed to be 10x faster if all you did was drop one out of three multiplications?

by dist-epoch22 hours ago|

parent|

prev|

[-]

Because you are working in the cache.

Also, you should use SIMD.

by lacedeconstruct22 hours ago|

parent|

[-]

> Also, you should use SIMD. ironically no clang is better at auto vectorizing

by spider-mario9 hours ago|

parent|

[-]

Better than what? And do you use `-mavx2` or do you let it target baseline x86_64 and miss out on 8-float vectors? How do you make sure its autovectorisation is successful?

by szundi22 hours ago|

prev|

[-]

[dead]