undefined

upvote

points

by jonstewart67 days ago |

upvote

by Rusky67 days ago|

[-]

It's worth noting (and the paper does go into this) that this is limited to a very specific subset of UB, which they call "guardable."

They are not removing UB around things like out-of-bounds or use-after-free, which would likely be more expensive.

reply

upvote

by jonstewart67 days ago|

[-]

I don’t understand the down votes. Conducting empirical research on the performance impact of undefined behavior is fantastically needed, as the C++ committee’s obsession with undefined behavior strictness (in contrast with longstanding semantics, e.g., uninitialized memory accesses being just fine) has been justified largely by how they enable optimizing compilers. This research shows that many types of UB have a negligible impact on performance.

reply

upvote

by atq211967 days ago|

[-]

Possibly somebody downvoted because "thank you" in all caps is not a substantial contribution to discussion. It feels like the kind of low effort stuff you'd see on reddit.

Also, commenting on downvotes is generally frowned upon.

reply

upvote

by saagarjha67 days ago|

[-]

You're getting downvoted because you're looking for a particular result ("UB optimizations don't help performance") rather than actually evaluating the quality of this analysis (which doesn't really support what you want anyway).

reply

upvote

by ryao67 days ago|

[-]

> by using link-time optimizations

These are almost never used by software.

reply

upvote

by mgaunard67 days ago|

[-]

Only places where I've seen LTO not be used are places with bad and unreliable build systems that systematically introduce undefined behaviour by violating the ODR.

reply

upvote

by jeffbee67 days ago|

[-]

The only organization I've worked in that had comprehensive LTO for C++ code was Google. I've worked at other orgs even with 1000s of engineers where LTO, PGO, BOLT, and other things you might consider standard techniques were considered voodoo and too much trouble to bother with, despite the obvious efficiency improvements being left on the table.

reply

upvote

by com2kid67 days ago|

[-]

I helped with pgo work at Microsoft over 15 years ago, back when it was a Microsoft Research project.

The issue with early pgo implementations was getting a really good profile, as you had to have automation capable of fully exercising code paths that you knew would be hot in actual usage, and you needed good instrumentation to know what code paths those were!

The same problem exists now days, but programs are instrumented to hell and back to collect usage data.

reply

upvote

by jeffbee67 days ago|

[-]

I am willing to assume that organizations dedicated to shipping software to customers like Microsoft or Autodesk or somebody like that are almost certainly all in on optimization techniques. The organizations where I worked are ones that are operating first party or third party software in the cloud where they're responsible for building their own artifacts.

reply

upvote

by astrange67 days ago|

[-]

PGO is pretty difficult. In my experience compilers don't seem to know the difference between "this thing never runs" and "we don't have any information about if this thing runs". Similarly it might be useful to know "is this branch predictable" more than just "what % is it taken".

CPUs are so dynamic anyway that there often isn't a way to pass down the information you'd get from the profile. eg I don't think Intel actually recommends any way of hinting branch directions.

reply

upvote

by jeffbee67 days ago|

[-]

It's implied by the target offset. Taken branches jump backwards, unlikely branches jump forward.

reply

upvote

by atq211967 days ago|

[-]

Not generally, no. This is true for some chips, especially (very) old or simple cores, but it's not something to lean on for modern high end cores.

reply

upvote

by jeffbee67 days ago|

[-]

Generally yes. This is not for "simple" cores this is the state-of-the-art static branch prediction algorithm as described by Intel in their optimization manual.

"Branches that do not have a history in the BTB ... are predicted using a static prediction algorithm: Predict forward conditional branches to be NOT taken. Predict backward conditional branches to be taken."

It then goes on to recommend exactly what every optimizing compiler and post-link optimizers like BOLT do:

"Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target."

This is why a reduction in taken forward branches is one of the key statistics that BOLT reports.

reply

upvote

by saagarjha67 days ago|

[-]

Surely you are not putting code behind an if/else

reply

upvote

by UncleMeat67 days ago|

[-]

Google doesn't have full-lto either, since binaries are way too big. Thin-lto is vastly less powerful.

reply

upvote

by jeffbee67 days ago|

[-]

"Vastly" eh? I seem to recall that LLVM ThinLTO has slight regressions compared to GCC LTO on specCPU but on Google's own applications the superior whole-program devirtualization offered only with ThinLTO is a net win.

reply

upvote

by UncleMeat67 days ago|

[-]

I'll adjust my phrasing.

As a user, building with thin-lto vs full-lto generally produces pretty similar performance in no small part because a huge amount of effort has gone into making the summaries as effective as possible for key performance needs.

As a compiler developer, especially when developing static analysis warnings rather than optimization passes, the number of cases where I've run into "this would be viable if we had full-lto" has been pretty high.

reply

upvote

by mgaunard67 days ago|

[-]

In practice the default ABI on linux x86-64 is still limiting you to binaries that are 4G or thereabout.

Not exactly a problem for LTO since any reasonable build machine will have 128GB of ram.

reply

upvote

by pcwalton67 days ago|

[-]

Yeah, I would have liked to see the paper specify whether the LTO they tried is fat LTO or ThinLTO.

reply

upvote

by loeg67 days ago|

[-]

Facebook uses LTO/PGO for C++ pretty broadly.

reply

upvote

by jeffbee67 days ago|

[-]

Yeah they just never hired me. They also invented BOLT.

I think there is a valley in terms of organization size where you have tons of engineers but not enough to accomplish peak optimization of C++ projects. These are the orgs that are spending millions to operate, for example, the VERY not-optimized packages of postgresql from Ubuntu, in AWS.

reply

upvote

by spookie67 days ago|

[-]

Well, Ubuntu isn't really a good project to look up upon :)

Hell, their latest upgrade broke one of their flavours. Not to mention how fragile their installer is.

reply

upvote

by tialaramex67 days ago|

[-]

Violating ODR doesn't introduce UB it's IFNDR, Ill-formed No Diagnostic Required which is much worse in principle and in such cases probably also in practice.

UB is a runtime phenemenon, it happens, or it doesn't, and we may be able to ensure the case where it happens doesn't occur with ordinary human controls.

But IFNDR is a property of the compiled program, if you have IFNDR (by some estimates that's most C++ programs) your program has no defined behaviour and never did, so there is no possible countermeasure, too bad game over.

reply

upvote

by ryao66 days ago|

[-]

I am curious where you have seen LTO used. Linux distributions and open source projects in general rarely use LTO. Their build systems are usually very good.

reply

upvote

by jandrewrogers67 days ago|

[-]

LTO is heavily used in my experience. If it breaks something that is indicative of other issues that need to be addressed.

reply

upvote

by yxhuvud67 days ago|

[-]

Main issue isn't that it break stuff but that it tend to be pretty slow to compile with it.

reply

upvote

by jorvi67 days ago|

[-]

.. that's why you compile without LTO during development and do a final 'compile with LTO > profile > fix / optimize > compile with LTO' pass.

Compilation happens once and then runs on hundreds of thousands up to billions of devices. Respect your users.

reply

upvote

by astrange67 days ago|

[-]

This assumes that LTO is strictly better than no-LTO, ie only gets faster, has the same optimization hotspots, and doesn't break anything.

I would recommend only doing things that fit within the 'build > text > fix' loop.

reply

upvote

by Sharlin67 days ago|

[-]

Which doesn't matter at all in a release build. And in a dev build it's rarely necessary.

reply

upvote

by pcwalton67 days ago|

[-]

At FAANG scale the cost is prohibitive. Hence the investment in ThinLTO.

reply

upvote

by KerrAvon67 days ago|

[-]

At FAANG scale, you absolutely want to have a pass before deployment that does this or you're leaving money on the table.

reply

upvote

by scott_s67 days ago|

[-]

It's not as obvious a win as you may think. Keep in mind that for every binary that gets deployed and executed, it will be compiled many more times before and after for testing. For some binaries, this number could easily reach the hundreds of thousands of times. Why? In a monorepo, a lot of changes come in every day, and testing those changes involves traversing a reachability graph of potentially affected code and running their tests.

reply

upvote

by ryao66 days ago|

[-]

How many Linux distributions use LTO? It is a rarity among Gentoo users as far as I know and that is the one place where you would expect more LTO usage.

reply

upvote

by steveklabnik67 days ago|

[-]

It's on by default for Rust release builds, so at least the codepaths in LLVM for it are well-exercised.

reply

upvote

by vlovich12367 days ago|

[-]

I don't think that's right unless the docs are stale:

    [profile.release]
    lto = false

https://doc.rust-lang.org/cargo/reference/profiles.html#rele...

reply

upvote

by steveklabnik67 days ago|

[-]

So the thing is that false means thinlto is used depending on other settings, see https://doc.rust-lang.org/cargo/reference/profiles.html#lto

> false: Performs “thin local LTO” which performs “thin” LTO on the local crate only across its codegen units.

I think this is kind of confusing but whatever. I should have been more clear.

reply

upvote

by LegionMammal97867 days ago|

[-]

There is no cross-crate LTO with 'lto = false', but there is cross-crate thin LTO with 'lto = "thin"'. The codepaths might still be getting hit, but individual CGUs within a crate are generally invisible to the user, which can create the impression that LTO doesn't occur. (That is, if you operate under the mental model of the crate being the basic compilation unit, then 'lto = false' means you'll never see LTO.)

reply

upvote

by vlovich12366 days ago|

[-]

Oh I hadn’t realized Rust does that. Really cool.

reply

upvote

by alpaca12867 days ago|

[-]

That must have been changed sometime in the last year then. When I enable LTO for one of my projects on a Rust compiler from 2024 the compilation time more than doubles.

reply

upvote

by steveklabnik67 days ago|

[-]

I should have been more clear: thin LTO is, not full “fat” LTO, for exactly that reason.

reply