upvote
The only organization I've worked in that had comprehensive LTO for C++ code was Google. I've worked at other orgs even with 1000s of engineers where LTO, PGO, BOLT, and other things you might consider standard techniques were considered voodoo and too much trouble to bother with, despite the obvious efficiency improvements being left on the table.
reply
I helped with pgo work at Microsoft over 15 years ago, back when it was a Microsoft Research project.

The issue with early pgo implementations was getting a really good profile, as you had to have automation capable of fully exercising code paths that you knew would be hot in actual usage, and you needed good instrumentation to know what code paths those were!

The same problem exists now days, but programs are instrumented to hell and back to collect usage data.

reply
I am willing to assume that organizations dedicated to shipping software to customers like Microsoft or Autodesk or somebody like that are almost certainly all in on optimization techniques. The organizations where I worked are ones that are operating first party or third party software in the cloud where they're responsible for building their own artifacts.
reply
PGO is pretty difficult. In my experience compilers don't seem to know the difference between "this thing never runs" and "we don't have any information about if this thing runs". Similarly it might be useful to know "is this branch predictable" more than just "what % is it taken".

CPUs are so dynamic anyway that there often isn't a way to pass down the information you'd get from the profile. eg I don't think Intel actually recommends any way of hinting branch directions.

reply
It's implied by the target offset. Taken branches jump backwards, unlikely branches jump forward.
reply
Not generally, no. This is true for some chips, especially (very) old or simple cores, but it's not something to lean on for modern high end cores.
reply
Generally yes. This is not for "simple" cores this is the state-of-the-art static branch prediction algorithm as described by Intel in their optimization manual.

"Branches that do not have a history in the BTB ... are predicted using a static prediction algorithm: Predict forward conditional branches to be NOT taken. Predict backward conditional branches to be taken."

It then goes on to recommend exactly what every optimizing compiler and post-link optimizers like BOLT do:

"Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target."

This is why a reduction in taken forward branches is one of the key statistics that BOLT reports.

reply
Surely you are not putting code behind an if/else
reply
Google doesn't have full-lto either, since binaries are way too big. Thin-lto is vastly less powerful.
reply
"Vastly" eh? I seem to recall that LLVM ThinLTO has slight regressions compared to GCC LTO on specCPU but on Google's own applications the superior whole-program devirtualization offered only with ThinLTO is a net win.
reply
I'll adjust my phrasing.

As a user, building with thin-lto vs full-lto generally produces pretty similar performance in no small part because a huge amount of effort has gone into making the summaries as effective as possible for key performance needs.

As a compiler developer, especially when developing static analysis warnings rather than optimization passes, the number of cases where I've run into "this would be viable if we had full-lto" has been pretty high.

reply
In practice the default ABI on linux x86-64 is still limiting you to binaries that are 4G or thereabout.

Not exactly a problem for LTO since any reasonable build machine will have 128GB of ram.

reply
Yeah, I would have liked to see the paper specify whether the LTO they tried is fat LTO or ThinLTO.
reply
Facebook uses LTO/PGO for C++ pretty broadly.
reply
Yeah they just never hired me. They also invented BOLT.

I think there is a valley in terms of organization size where you have tons of engineers but not enough to accomplish peak optimization of C++ projects. These are the orgs that are spending millions to operate, for example, the VERY not-optimized packages of postgresql from Ubuntu, in AWS.

reply
Well, Ubuntu isn't really a good project to look up upon :)

Hell, their latest upgrade broke one of their flavours. Not to mention how fragile their installer is.

reply
Violating ODR doesn't introduce UB it's IFNDR, Ill-formed No Diagnostic Required which is much worse in principle and in such cases probably also in practice.

UB is a runtime phenemenon, it happens, or it doesn't, and we may be able to ensure the case where it happens doesn't occur with ordinary human controls.

But IFNDR is a property of the compiled program, if you have IFNDR (by some estimates that's most C++ programs) your program has no defined behaviour and never did, so there is no possible countermeasure, too bad game over.

reply
I am curious where you have seen LTO used. Linux distributions and open source projects in general rarely use LTO. Their build systems are usually very good.
reply