upvote
By interpolating between values you can get excellent results with LUTs much smaller than 32KB. Will it be faster than the computation from op, that I don't know.
reply
I experimented a bit with the code. Various tables with different datatypes. There is enough noise from the Monte Carlo to not make a difference if you use smaller data types than double or float. Even dropping interpolation worked fine, and got the speed to be on par with the best in the article, but not faster.
reply
Does your benchmark use sequential or randomly ordered inputs? That would make a substantial difference with an LUT, I would think. But I'm guessing. Maybe 32K is so small it doesn't matter (if almost all of the LUT sits in the cache and is never displaced).

> if you use smaller data types than double or float. Even dropping interpolation worked fine,

That's kinda tautological isn't it? Of course reduced precision is acceptable where reduced precision is acceptable... I guess I'm assuming double precision was used for a good reason, it often isn't :)

reply
I didnt inspect the rest of the code but I guess the table is fetched from L2 on every call? I think the L1 data cache is flooded by other stuff going on all the time.

About dropping the interpolation: Yes you are right of course. I was thinking about the speed. No noticable speed improvement by dropping interpolation. The asin calls are only a small fraction of everything.

reply
I'm very skeptical you wouldn't get perceptible visual artifacts if you rounded the trig functions to 4096 linear approximations. But I'd be happy to be proven wrong :)
reply