Faster asin() was hiding in plain sight

upvote

Faster asin() was hiding in plain sight

(16bpp.net)

154 points

by def-pri-pub5 hours ago |

upvote

by jason_s4 hours ago|

[-]

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

reply

upvote

by ogogmad49 minutes ago|

[-]

Chebyshev polynomials cos(n arcos(x)) provide one of the proofs that every continuous function f:[0,1]->R can be uniformly approximated by polynomial functions. Bernstein polynomials provide a shorter proof, but perhaps not the best numerical method: https://en.wikipedia.org/wiki/Bernstein_polynomial#See_also

reply

upvote

by andyjohnson06 minutes ago|

[-]

Interesting aeticle. A few years back I implemented a bunch of maths primitives, including trig functions, using Taylor sequences etc, to see how it was done. An interesting challenge, even at the elementary level I was working at.

So this article got me wondering how much accuracy is needed before computing a series beats pre-computed lookup tables and interpolation. Anyone got any relevant experience to share?

How much accuracy does ray tracing require?

reply

upvote

by xt003 hours ago|

[-]

To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.

reply

upvote

by LegionMammal9784 hours ago|

[-]

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

reply

upvote

by herf3 hours ago|

[-]

They teach a lot of Taylor/Maclaurin series in Math classes (and trig functions are sometimes called "CORDIC" which is an old method too) but these are not used much in actual FPUs and libraries. Maybe we should update the curricula so people know better ways.

reply

upvote

by bee_rider2 hours ago|

[-]

Taylor series makes a lot more sense in a math class, right? It is straightforward and (just for example), when you are thinking about whether or not a series converges in the limit, why care about the quality of the approximation after a set number of steps?

reply

upvote

by jason_s4 hours ago|

[-]

Not sure I would call Remez "simple"... it's all relative; I prefer Chebyshev approximation which is simpler than Remez.

reply

upvote

by LegionMammal9783 hours ago|

[-]

Perhaps, but at least I find it very simple for the optimality properties it gives: there is no inherent need to say, "I know that better parameters likely exist, but the algorithm to find them would be hopelessly expensive," as is the case in many forms of minimax optimization.

reply

upvote

by stephencanon3 hours ago|

[-]

Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.

reply

upvote

by srean15 minutes ago|

[-]

Not directly related, but if you are reaching for sin, cos, asin, acos, atan and their relatives to handle rotation, you may save yourself a lot of trouble by representing angle not as a scalar but as (i) a cos, sin tuple or (ii) equivalently as a complex numbers.

You may then get away with simple algebra and square roots. A runtime such as Python would do a lot of that transparently.

reply

upvote

by exmadscientist4 hours ago|

[-]

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

reply

upvote

by wolfi13 hours ago|

[-]

Abramowitz/Stegun has been updated 2010 and resides now here: https://dlmf.nist.gov/

reply

upvote

by def-pri-pub2 hours ago|

[-]

These are books that my uni courses never had me read. I'm a little shocked at times at how my degree program skimped on some of the more famous texts.

reply

upvote

by neutronicus1 hours ago|

[-]

It is not a textbook, it is an extremely dense reference manual, so that honestly makes sense.

In physics grad school, professors would occasionally allude to it, and textbooks would cite it ... pretty often. So it's a thing anyone with postgraduate physics education should know exists, but you wouldn't ever be assigned it.

reply

upvote

by cmovq2 hours ago|

[-]

> After all of the above work and that talk in mind, I decided to ask an LLM.

Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.

[1]: https://stackoverflow.com/a/26030435

reply

upvote

by def-pri-pub1 hours ago|

[-]

I did see that, but isn't the vast majority of that page talking about acos() instead?

reply

upvote

by AlotOfReading4 hours ago|

[-]

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }

Courtesy of evil floating point bithacks.

reply

upvote

by jacquesm2 hours ago|

[-]

That could do with some subtitles.

reply

upvote

by irishcoffee2 hours ago|

[-]

https://en.wikipedia.org/wiki/Fast_inverse_square_root

reply

upvote

by def-pri-pub4 hours ago|

[-]

> floating point bithacks

The forbidden magic

reply

upvote

by chuckadams4 hours ago|

[-]

You brought Zalgo. I blame this decade on you.

reply

upvote

by adampunk3 hours ago|

[-]

// what the fuck

reply

upvote

by moffkalast4 hours ago|

[-]

> float asinine(float x) {

FTFY :P

reply

upvote

by scottlamb4 hours ago|

[-]

Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by TimorousBestie4 hours ago|

[-]

I don’t know much about raytracing but it’s probably tricky to orchestrate all those asin calls so that the input and output memory is aligned and contiguous. My uneducated intuition is that there’s little regularity as to which pixels will take which branches and will end up requiring which asin calls, but I might be wrong.

reply

upvote

by scottlamb4 hours ago|

[-]

I'd expect it to come down to data-oriented design: SoA (structure of arrays) rather than AoS (array of structures).

I skimmed the author's source code, and this is where I'd start: https://github.com/define-private-public/PSRayTracing/blob/8...

Instead of an `_objects`, I might try for a `_spheres`, `_boxes`, etc. (Or just `_lists` still using the virtual dispatch but for each list, rather than each object.) The `asin` seems to be used just for spheres. Within my `Spheres::closest_hit` (note plural), I'd work to SIMDify it. (I'd try to SIMDify the others too of course but apparently not with `asin`.) I think it's doable: https://github.com/define-private-public/PSRayTracing/blob/8...

I don't know much about ray tracers either (having only written a super-naive one back in college) but this is the general technique used to speed up games, I believe. Besides enabling SIMD, it's more cache-efficient and minimizes dispatch overhead.

edit: there's also stuff that you can hoist in this impl. Restructuring as SoA isn't strictly necessary to do that, but it might make it more obvious and natural. As an example, this `ray_dir.length_squared()` is the same for the whole list. You'd notice that when iterating over the spheres. https://github.com/define-private-public/PSRayTracing/blob/8...

reply

upvote

by def-pri-pub1 hours ago|

[-]

When I was working on this project, I was trying to restrict myself to the architecture of the original Ray Tracing in One Weekend book series. I am aware that things are not as SIMD friendly and that becomes a major bottle neck. While I am confident that an architectural change could yield a massive performance boost, it's something I don't want to spend my time on.

I think it's also more fun sometimes to take existing systems and to try to optimize them given whatever constraints exist. I've had to do that a lot in my day job already.

reply

upvote

by TimorousBestie4 hours ago|

[-]

This tracks with my experience and seems reasonable, yes. I tend to SoA all the things, sometimes to my coworkers’ amusement/annoyance.

reply

upvote

by Am4TIfIsER0ppos4 hours ago|

[-]

I don't do much float work but I don't think there is a single regular sine instruction only old x87 float stack ones.

I was curious what "sequence" would end up being but my compiler is too old for that intrinsic. Even godbolt didn't help for gcc or clang but it did reveal that icc produced a call https://godbolt.org/z/a3EsKK4aY

reply

upvote

by nitwit0051 hours ago|

[-]

If you click libraries on godbolt, it's pulling in a bunch, including multiple SIMD libraries. You might have to fiddle with the libraries or build locally.

reply

upvote

by orangepanda5 hours ago|

[-]

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

reply

upvote

by debo_27 minutes ago|

[-]

https://bash-org-archive.com/?427792

reply

upvote

by sixo3 hours ago|

[-]

It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.

A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.

reply

upvote

by glitchc4 hours ago|

[-]

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

reply

upvote

by def-pri-pub1 hours ago|

[-]

You'd be surprised how it actually is worth the effort, even just a 1% improvement. If you have the time, this is a great talk to listen to: https://www.youtube.com/watch?v=kPR8h4-qZdk

For a little toy ray tracer, it is pretty measly. But for a larger corporation (with a professional project) a 4% speed improvement can mean MASSIVE cost savings.

Some of these tiny improvements can also have a cascading effect. Imagining finding a +4%, a +2% somewhere else, +3% in neighboring code, and a bunch of +1%s here and there. Eventually you'll have built up something that is 15-20% faster. Down the road you'll come across those optimizations which can yield the big results too (e.g. the +25%).

reply

upvote

by glitchc41 minutes ago|

[-]

It's a cool talk, but the relevance to the present problem escapes me.

If you're alluding to gcc vs fbstring's performance (circa 15:43), then the performance improvement is not because fbstring is faster/simpler, but due to a foundational gcc design decision to always use the heap for string variables. Also, at around 16:40, the speaker concedes that gcc's simpler size() implementation runs significantly faster (3x faster at 0.3 ns) when the test conditions are different.

reply

upvote

by kstrauser4 hours ago|

[-]

> The 4% improvement doesn't seem like it's worth the effort.

People have gotten PhDs for smaller optimizations. I know. I've worked with them.

> instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs.

What's the x86-64 opcode for arcsin?

reply

upvote

by adrian_b1 hours ago|

[-]

Presumably the poster meant polynomial approximations of trigonometric functions not instructions for trigonometric functions, which are missing in most CPUs, though many GPUs have such instructions.

x86-64 had instructions for the exponential and logarithmic functions in Xeon Phi, but those instructions have been removed in Skylake Server and the later Intel or AMD CPUs with AVX-512 support.

However, instructions for trigonometric functions have no longer been added after Intel 80387, and those of 8087 and 80387 are deprecated.

reply

upvote

by glitchc59 minutes ago|

[-]

> What's the x86-64 opcode for arcsin?

Not required. ATAN and SQRTS(S|D) are sufficient, the half-angle approach in the article is the recommended way.

> People have gotten PhDs for smaller optimizations. I know. I've worked with them.

I understand the can, not sure about the should. Not trying to be snarky, we just seem to be producing PhDs with the slimmest of justifications. The bar needs to be higher.

reply

upvote

by kstrauser4 minutes ago|

[-]

> I understand the can, not sure about the should. Not trying to be snarky, we just seem to be producing PhDs with the slimmest of justifications. The bar needs to be higher.

I couldn't disagree more. Sure, making a 4% faster asin isn't going to change the world, but if it makes all callers a teensy bit faster, multiplied by the number of callers using it, then it adds up. Imagine the savings for a hyperscaler if they managed to made a more common instruction 4% faster.

reply

upvote

by tverbeure2 hours ago|

[-]

> The 4% improvement doesn't seem like it's worth the effort.

I've spent the past few months improving the performance of some work thing by ~8% and the fun I've been having reminds me of the nineties, when I tried to squeeze every last % of performance out of the 3D graphics engine that I wrote as a hobby.

reply

upvote

by charcircuit3 hours ago|

[-]

The effort of typing about 10 words into a LLM is minimal.

reply

upvote

by empiricus3 hours ago|

[-]

Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.

reply

upvote

by erichocean5 hours ago|

[-]

Ideal HN content, thanks!

reply

upvote

by ok1234563 hours ago|

[-]

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

reply

upvote

by drsopp4 hours ago|

[-]

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

reply

upvote

by Pannoniae4 hours ago|

[-]

Microbenchmarks. A LUT will win many of them but you pessimise the rest of the code. So unless a significant (read: 20+%) portion of your code goes into the LUT, there isn't that much point to bother. For almost any pure calculation without I/O, it's better to do the arithmetic than to do memory access.

reply

upvote

by jcalvinowens4 hours ago|

[-]

Locality within the LUT matters too: if you know you're looking up identical or nearby-enough values to benefit from caching, an LUT can be more of a win. You only pay the cache cost for the portion you actually touch at runtime.

I could imagine some graphics workloads tend compute asin() repeatedly with nearby input values. But I'd guess the locality isn't local enough to matter, only eight double precision floats fit in a cache line.

reply

upvote

by groundzeros20154 hours ago|

[-]

I don’t want to fill up L1 for sin.

reply

upvote

by jcalvinowens4 hours ago|

[-]

Surely the loss in precision of a 32KB LUT for double precision asin() would be unacceptable?

reply

upvote

by Jyaif4 hours ago|

[-]

By interpolating between values you can get excellent results with LUTs much smaller than 32KB. Will it be faster than the computation from op, that I don't know.

reply

upvote

by drsopp3 hours ago|

[-]

I experimented a bit with the code. Various tables with different datatypes. There is enough noise from the Monte Carlo to not make a difference if you use smaller data types than double or float. Even dropping interpolation worked fine, and got the speed to be on par with the best in the article, but not faster.

reply

upvote

by jcalvinowens3 hours ago|

[-]

Does your benchmark use sequential or randomly ordered inputs? That would make a substantial difference with an LUT, I would think. But I'm guessing. Maybe 32K is so small it doesn't matter (if almost all of the LUT sits in the cache and is never displaced).

> if you use smaller data types than double or float. Even dropping interpolation worked fine,

That's kinda tautological isn't it? Of course reduced precision is acceptable where reduced precision is acceptable... I guess I'm assuming double precision was used for a good reason, it often isn't :)

reply

upvote

by drsopp2 hours ago|

[-]

I didnt inspect the rest of the code but I guess the table is fetched from L2 on every call? I think the L1 data cache is flooded by other stuff going on all the time.

About dropping the interpolation: Yes you are right of course. I was thinking about the speed. No noticable speed improvement by dropping interpolation. The asin calls are only a small fraction of everything.

reply

upvote

by jcalvinowens3 hours ago|

[-]

I'm very skeptical you wouldn't get perceptible visual artifacts if you rounded the trig functions to 4096 linear approximations. But I'd be happy to be proven wrong :)

reply

upvote

by stephc_int134 hours ago|

[-]

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

reply

upvote

by 5 hours ago|

[-]

deleted

reply

upvote

by varispeed1 hours ago|

[-]

If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by adampunk5 hours ago|

[-]

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

reply

upvote

by def-pri-pub1 hours ago|

[-]

Funny enough that fdlimb implementation of asin() did come up in my research. I believe it might have been more performant in the past. But taking a quick scan of `e_asin.c`, I see it doing something similar to the Cg asin() implementation (but with more terms and more multiplications, which my guess is that it's slower). I think I see it also taking more branches (which could also lead to more of a slowdown).

reply

upvote

by patchnull2 hours ago|

[-]

[flagged]

reply

upvote

by Sesse__2 hours ago|

[-]

And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like

  A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D

(10 muls, 3 muladds)

instead of the faster

  tmp = x * x;
  ((A * tmp + B) * tmp + C) * tmp + D

(1 mul, 3 muladds)

reply

upvote

by anematode34 minutes ago|

[-]

Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like

    typedef double v2d __attribute__ ((vector_size (16)));

    v2d packed = { x, x };
    packed = fma(packed, As, Bs);
    packed = fma(packed, Cs, Ds);
    // ...
    return x * packed[0] + packed[1]

smth like that

Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.

reply

upvote

by eska2 hours ago|

[-]

The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.

reply

upvote

by cmovq1 hours ago|

[-]

Not in this case because the dependencies are the same:

Naive: https://godbolt.org/z/Gzf1KM9Tc

Horner's: https://godbolt.org/z/jhvGqcxj1

reply

upvote

by Sesse__2 hours ago|

[-]

Still, it's no worse than the naïve formula, which has exactly the same data dependencies and then more.

_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?

reply

upvote

by stephencanon17 minutes ago|

[-]

For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.

There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.

There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.

reply

upvote

by woadwarrior012 hours ago|

[-]

The common subexpression elimination (CSE) pass in compilers takes care of that.

reply

upvote

by cmovq2 hours ago|

[-]

Compilers cannot do this optimization for floating point [1] unless you're compiling with -ffast-math. In general, don't rely on compilers to optimize floating point sub-expressions.

[1]: https://godbolt.org/z/8bEjE9Wxx

reply

upvote

by woadwarrior011 hours ago|

[-]

Right, I totally forgot about floating point non associativity.

reply

upvote

by def-pri-pub1 hours ago|

[-]

The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.

reply

upvote

by pavpanchekha1 hours ago|

[-]

Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.

reply

upvote

by owlbite1 hours ago|

[-]

Not just for speed, Horner can also be essential for numerical stability.

reply

upvote

by arkmm2 hours ago|

[-]

Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?

reply

upvote

by Sesse__2 hours ago|

[-]

No, it's not equivalent for floating point, so a compiler won't do it unless you do -fassociative-math (or a superset, such as -ffast-math), in which case all correctness bets are off.

reply

upvote

by 33712 hours ago|

[-]

Isn't that for... readability...?

reply

upvote

by zahlman2 hours ago|

[-]

Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)

reply

upvote

by boothby2 hours ago|

[-]

Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).

But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.

reply

upvote

by exmadscientist2 hours ago|

[-]

You should never assume the compiler is allowed to reorder floating-point computations like it does with integers. Integer math is exact, within its domain. Floating-point math is not. The IEEE-754 standard knows this, and the compiler knows this.

reply

upvote

by boothby2 hours ago|

[-]

Ah, fair point, it has been a while since I've needed fast inexact math.

Though... they are allowed to cache common subexpressions, and my point about dependency chains is quite relevant on modern hardware. So x*x, x*x*x, etc may each be computed once. And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.

reply

upvote

by Sesse__1 hours ago|

[-]

> And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.

This is incorrect, for exactly the reason you are citing: A * x * x * x * x = (((A * x) * x) * x) * x), which means that (x * x) is nowhere to be seen in the expression and cannot be factored out. Now, if you wrote x * x * x * x * A instead, _then_ the compiler could have done partial CSE against the term with B, although still not as much as you'd like.

reply

upvote

by eska2 hours ago|

[-]

The compiler is often not allowed to rearrange such operations due to a change in intermediate results. So one would have to activate something like fastmath for this code, but that’s probably not desired for all code, so one has to introduce a small library, and so on. Debug builds may be using different compilation flags, and suddenly performance can become terrible while debugging. Performance can also tank because a new compiler version optimizes differently, etc. So in general I don’t think this advice is true.

reply

upvote

by scottlamb1 hours ago|

[-]

Probably for ints unconditionally. For floats in Sesse__'s example without `-ffast-math`, I count 10 muls, 2 muladds, 1 add. With `-ffast-math`, 1 mul, 3 muladds. <https://godbolt.org/z/dPrbfjzEx>

reply

upvote

by david-gpu2 hours ago|

[-]

Yeah, I once worked at a place where the compiler team was assigned the unpleasant task of implementing a long list of trigonometry functions. They struggled for many months to get the accuracy that was required of them, and when they did the performance was abysmal compared to the competition.

In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.

reply

upvote

by patchnull5 hours ago|

[-]

[flagged]

reply

upvote

by stephencanon4 hours ago|

[-]

These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.

The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.

Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.

reply

upvote

by Pannoniae4 hours ago|

[-]

Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.

reply

upvote

by stephencanon4 hours ago|

[-]

Newer rsqrt approximations (ARM NEON and SVE, and the AVX512F approximations on x86) make the behavior architectural so this is somewhat less of a problem (it still varies between _architectures_, however).

reply

upvote

by def-pri-pub2 hours ago|

[-]

Wait, what? Do you have a resource I could read up on about that? That is moderately concerning if your math isn't portable across chips.

reply

upvote

by stephencanon2 hours ago|

[-]

When Intel specced the rsqrt[ps]s and rcp[ps]s instructions ~30 years ago, they didn't fully specify their behavior. They just said their relative error is "smaller than 1.5 * 2⁻¹²," which someone thought was very clever because it gave them leeway to use tables or piecewise linear approximations or digit-by-digit computation or whatever was best suited to future processors. Since these are not IEEE 754 correctly-rounded operations, and there was (by definition) no software that currently used them, this was "fine".

And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.

However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.

¹ e.g. Intel "specifies" theirs here: https://www.intel.com/content/www/us/en/developer/articles/c...

ARM's is a little more readable: https://developer.arm.com/documentation/ddi0596/2021-03/Shar...

reply

upvote

by def-pri-pub1 hours ago|

[-]

Thanks!

reply

upvote

by patchnull4 hours ago|

[-]

[flagged]

reply

upvote

by stephencanon3 hours ago|

[-]

For the asinf libcall on macOS/x86, my former colleague Eric Postpischil invented the novel (at least at the time, I believe) technique of using a Remez-optimized refinement polynomial following rsqrtss instead of the standard Newton-Raphson iteration coefficients, which allowed him to squeeze out just enough extra precision to make the function achieve sub-ulp accuracy. One of my favorite tricks.

We didn't carry that algorithm forward to arm64, sadly, because Apple's architects made fsqrt fast enough that it wasn't worth it in scalar contexts.

reply

upvote

by def-pri-pub1 hours ago|

[-]

I did scan some (major) open source games and graphics related project and found a few of them using `std::asin()`. I plan on submitting some patches.

reply