A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D
(10 muls, 3 muladds)instead of the faster
tmp = x * x;
((A * tmp + B) * tmp + C) * tmp + D
(1 mul, 3 muladds) typedef double v2d __attribute__ ((vector_size (16)));
v2d packed = { x, x };
packed = fma(packed, As, Bs);
packed = fma(packed, Cs, Ds);
// ...
return x * packed[0] + packed[1]
smth like thatActually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
Naive: https://godbolt.org/z/Gzf1KM9Tc
Horner's: https://godbolt.org/z/jhvGqcxj1
_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?
There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.
There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
Though... they are allowed to cache common subexpressions, and my point about dependency chains is quite relevant on modern hardware. So x*x, x*x*x, etc may each be computed once. And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.
This is incorrect, for exactly the reason you are citing: A * x * x * x * x = (((A * x) * x) * x) * x), which means that (x * x) is nowhere to be seen in the expression and cannot be factored out. Now, if you wrote x * x * x * x * A instead, _then_ the compiler could have done partial CSE against the term with B, although still not as much as you'd like.
In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.