undefined

points

by cogman1022 hours ago |

comments

by moregrist18 hours ago|

[-]

> Integer tricks and optimizations are pointless.

They’re not pointless; they’re just not the first thing to optimize.

It’s like worrying about cache locality when you have an inherently O(n^2) algorithm and could have a O(n log n) or O(n) one. Fix the biggest problem first.

Once your data layout is good and your cpu isn’t taking a 200 cycle lunch break to chase pointers, then you worry about cycle count and keeping the execution units fed.

That’s when integer tricks can matter. Depending on the micro arch, you may have twice as many execution units that can take integer instructions. And those instructions (outside of division) tend to have lower latency and higher throughput.

And if you’re doing SIMD, your integer SIMD instructions can be 2 or 4x higher throughput than float32 if you can use int16 / int8 data.

So it can very much matter. It’s just usually not the lowest hanging fruit.

by Dylan1680714 hours ago|

parent|

[-]

> And if you’re doing SIMD, your integer SIMD instructions can be 2 or 4x higher throughput than float32 if you can use int16 / int8 data.

Your float instructions can also be 2x the throughput if you use f16. With no need to go for specific divisors.

For values that even can pack into 8 bits, you rarely have a way to process enough at once to actually get more throughput than with wider numbers.

I'm sure there's a program where it very much matters, but my bet is on it not even mildly mattering, and there basically always being a hundred more useful optimizations to work on.

by benchloftbrunch11 hours ago|

parent|

[-]

Problem with f16 is that hardware support is still "new" and can't be relied on in consumer grade CPUs yet.

by Pannoniae21 hours ago|

prev|

[-]

This is all true but IMO forest for the trees.... For example the compiler basically doesn't do anything useful with your float math unless you enable fastmath. Period. Very few transformations are done automatically there.

For integers the situation is better but even there, it hugely depends on your compiler and how much it cheats. You can't replace trig with intrinsics in the general case (sets errno for example), inlining is at best an adequate heuristic which completely fails to take account what the hot path is unless you use PGO and keep it up to date.

I've managed to improve a game's worst case performance better by like 50% just by shrinking a method's codesize from 3000 bytes to 1500. Barely even touched the hot path there, keep in mind. Mostly due to icache usage.

The takeaway from this shouldn't be that "computers are fast and compilers are clever, no point optimising" but more that "you can afford not to optimise in many cases, computers are fast."

by WalterBright21 hours ago|

parent|

[-]

I originally got into writing compilers because I was convinced I could write a better code generator. I succeeded for about 10 years in doing very well with code generation. But then all the complexities of the evolving C++ (and D!) took up most of my time, and I haven't been able to work much on the optimizer since.

Fortunately, D compilers gdc and ldc take advantage of the gcc and llvm optimizers to stay even with everyone else.

by tialaramex8 hours ago|

parent|

[-]

The thing which would really help IMNSHO is to nail down the IR to eliminate weird ambiguities where OK optimisation A is valid according to one understanding, optimisation B is valid under another but alas if we use both sometimes it breaks stuff.

by WalterBright2 hours ago|

parent|

[-]

Yes, one of the unexpected problems I ran into is one optimization undoing another one, and the optimizer would flip-flop between the two states.

by cogman1021 hours ago|

parent|

prev|

[-]

I actually agree with you.

My point wasn't "don't optimize" it was "don't optimize the wrong thing".

Trying to replace a division with a bit shift is an example of worrying about the wrong thing, especially since that's a simple optimization the compiler can pick up on.

But as you said, it can be very worth it to optimize around things like the icache. Shrinking and aligning a hot loop can ensure your code isn't spending a bunch of time loading instructions. Cache behavior, in general, is probably the most important thing you can optimize. It's also the thing that can often make it hard to know if you actually optimized something. Changing the size of code can change cache behavior, which might give you the mistaken impression that the code change was what made things faster when in reality it was simply an effect of the code shifting.

by YesBox6 hours ago|

prev|

[-]

> A cache miss is going to mean anywhere from a 100 to 1000 cycle penalty. That blows out any sort of hit you take cutting your cycles from 3 to 1.

A good example of this is using std::vector<bool> vs. std::vector<uint8_t> in the debug build vs release build.

vector<bool> is much slower to access (it's a dynamic bitset). If you have a hot part of the code that frequently touches a vector<bool>, you'll see a multiple X slowdown in the debug build.

However, in the release build, there is no performance difference between the two (for me at least, I'm making a fairly complicated game). The cache misses bury it.

by fragmede4 hours ago|

parent|

[-]

Fascinating, that's counterintuitive. I'd think the point of using vector <bool> is because the compiler would optimize it to be a bit field which is fewer bits and thus smaller and thus faster than using vector <uint_t8>. How did you come to figure that out?

by YesBox3 hours ago|

parent|

[-]

I dont know how it's implemented by the standard/compiler (not my domain). The performance differences are well documented though.

I've used both in my pathing code and tested each in debug/release.

Even if the std:: implementation was as fast as possible, you're still adding bit manipulation on top of accessing the element, so it will be slower no matter what you do.

by rcxdude21 hours ago|

prev|

[-]

Also, it's unusual for a game to be CPU bottlenecked nowadays, and if it is, it's probably more constrained on memory bandwidth than raw FLOPS.

by eru15 hours ago|

parent|

[-]

Yes, though it depends a bit on your style of game.