Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.
I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.
So this article got me wondering how much accuracy is needed before computing a series beats pre-computed lookup tables and interpolation. Anyone got any relevant experience to share?
How much accuracy does ray tracing require?
You may then get away with simple algebra and square roots. A runtime such as Python would do a lot of that transparently.
> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.
was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."
In physics grad school, professors would occasionally allude to it, and textbooks would cite it ... pretty often. So it's a thing anyone with postgraduate physics education should know exists, but you wouldn't ever be assigned it.
Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.
float asin(float x) {
float x2 = 1.0f-fabs(x);
u32 i = bitcast(x2);
i = 0x5f3759df - (i>>1);
float inv = bitcast(i);
return copysign(pi/2-pi/2*(x2*inv),x);
}
Courtesy of evil floating point bithacks.The forbidden magic
FTFY :P
They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.
I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.
https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
I skimmed the author's source code, and this is where I'd start: https://github.com/define-private-public/PSRayTracing/blob/8...
Instead of an `_objects`, I might try for a `_spheres`, `_boxes`, etc. (Or just `_lists` still using the virtual dispatch but for each list, rather than each object.) The `asin` seems to be used just for spheres. Within my `Spheres::closest_hit` (note plural), I'd work to SIMDify it. (I'd try to SIMDify the others too of course but apparently not with `asin`.) I think it's doable: https://github.com/define-private-public/PSRayTracing/blob/8...
I don't know much about ray tracers either (having only written a super-naive one back in college) but this is the general technique used to speed up games, I believe. Besides enabling SIMD, it's more cache-efficient and minimizes dispatch overhead.
edit: there's also stuff that you can hoist in this impl. Restructuring as SoA isn't strictly necessary to do that, but it might make it more obvious and natural. As an example, this `ray_dir.length_squared()` is the same for the whole list. You'd notice that when iterating over the spheres. https://github.com/define-private-public/PSRayTracing/blob/8...
I think it's also more fun sometimes to take existing systems and to try to optimize them given whatever constraints exist. I've had to do that a lot in my day job already.
I was curious what "sequence" would end up being but my compiler is too old for that intrinsic. Even godbolt didn't help for gcc or clang but it did reveal that icc produced a call https://godbolt.org/z/a3EsKK4aY
I like throwing away work I've done. Frees up my mental capacity for other work to throw away.
A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.
On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.
For a little toy ray tracer, it is pretty measly. But for a larger corporation (with a professional project) a 4% speed improvement can mean MASSIVE cost savings.
Some of these tiny improvements can also have a cascading effect. Imagining finding a +4%, a +2% somewhere else, +3% in neighboring code, and a bunch of +1%s here and there. Eventually you'll have built up something that is 15-20% faster. Down the road you'll come across those optimizations which can yield the big results too (e.g. the +25%).
If you're alluding to gcc vs fbstring's performance (circa 15:43), then the performance improvement is not because fbstring is faster/simpler, but due to a foundational gcc design decision to always use the heap for string variables. Also, at around 16:40, the speaker concedes that gcc's simpler size() implementation runs significantly faster (3x faster at 0.3 ns) when the test conditions are different.
People have gotten PhDs for smaller optimizations. I know. I've worked with them.
> instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs.
What's the x86-64 opcode for arcsin?
x86-64 had instructions for the exponential and logarithmic functions in Xeon Phi, but those instructions have been removed in Skylake Server and the later Intel or AMD CPUs with AVX-512 support.
However, instructions for trigonometric functions have no longer been added after Intel 80387, and those of 8087 and 80387 are deprecated.
Not required. ATAN and SQRTS(S|D) are sufficient, the half-angle approach in the article is the recommended way.
> People have gotten PhDs for smaller optimizations. I know. I've worked with them.
I understand the can, not sure about the should. Not trying to be snarky, we just seem to be producing PhDs with the slimmest of justifications. The bar needs to be higher.
I couldn't disagree more. Sure, making a 4% faster asin isn't going to change the world, but if it makes all callers a teensy bit faster, multiplied by the number of callers using it, then it adds up. Imagine the savings for a hyperscaler if they managed to made a more common instruction 4% faster.
I've spent the past few months improving the performance of some work thing by ~8% and the fun I've been having reminds me of the nineties, when I tried to squeeze every last % of performance out of the 3D graphics engine that I wrote as a hobby.
I could imagine some graphics workloads tend compute asin() repeatedly with nearby input values. But I'd guess the locality isn't local enough to matter, only eight double precision floats fit in a cache line.
> if you use smaller data types than double or float. Even dropping interpolation worked fine,
That's kinda tautological isn't it? Of course reduced precision is acceptable where reduced precision is acceptable... I guess I'm assuming double precision was used for a good reason, it often isn't :)
About dropping the interpolation: Yes you are right of course. I was thinking about the speed. No noticable speed improvement by dropping interpolation. The asin calls are only a small fraction of everything.
A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D
(10 muls, 3 muladds)instead of the faster
tmp = x * x;
((A * tmp + B) * tmp + C) * tmp + D
(1 mul, 3 muladds) typedef double v2d __attribute__ ((vector_size (16)));
v2d packed = { x, x };
packed = fma(packed, As, Bs);
packed = fma(packed, Cs, Ds);
// ...
return x * packed[0] + packed[1]
smth like thatActually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
Naive: https://godbolt.org/z/Gzf1KM9Tc
Horner's: https://godbolt.org/z/jhvGqcxj1
_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?
There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.
There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
Though... they are allowed to cache common subexpressions, and my point about dependency chains is quite relevant on modern hardware. So x*x, x*x*x, etc may each be computed once. And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.
This is incorrect, for exactly the reason you are citing: A * x * x * x * x = (((A * x) * x) * x) * x), which means that (x * x) is nowhere to be seen in the expression and cannot be factored out. Now, if you wrote x * x * x * x * A instead, _then_ the compiler could have done partial CSE against the term with B, although still not as much as you'd like.
In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.
The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.
Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.
And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.
However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.
¹ e.g. Intel "specifies" theirs here: https://www.intel.com/content/www/us/en/developer/articles/c...
ARM's is a little more readable: https://developer.arm.com/documentation/ddi0596/2021-03/Shar...
We didn't carry that algorithm forward to arm64, sadly, because Apple's architects made fsqrt fast enough that it wasn't worth it in scalar contexts.