typedef double v2d __attribute__ ((vector_size (16)));
v2d packed = { x, x };
packed = fma(packed, As, Bs);
packed = fma(packed, Cs, Ds);
// ...
return x * packed[0] + packed[1]
smth like thatActually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
Naive: https://godbolt.org/z/Gzf1KM9Tc
Horner's: https://godbolt.org/z/jhvGqcxj1
_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?
There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.
There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
Though... they are allowed to cache common subexpressions, and my point about dependency chains is quite relevant on modern hardware. So x*x, x*x*x, etc may each be computed once. And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.
This is incorrect, for exactly the reason you are citing: A * x * x * x * x = (((A * x) * x) * x) * x), which means that (x * x) is nowhere to be seen in the expression and cannot be factored out. Now, if you wrote x * x * x * x * A instead, _then_ the compiler could have done partial CSE against the term with B, although still not as much as you'd like.