undefined

points

[-]

Even the latest CPUs have a 2:1 fp64:fp32 performance ratio - plus the effects of 2x the data size in cache and bandwidth use mean you can often get greater than a 2x difference.

If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P

by pixelesque5 hours ago|

parent|

[-]

> Even the latest CPUs have a 2:1 fp64:fp32 performance ratio

Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).

Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.

by omoikane2 hours ago|

parent|

[-]

> the latency between float and double is almost always the same on the most recent AMD/Intel CPUs

If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.

https://gcc.godbolt.org/z/7155YKTrK

by kimixa3 hours ago|

parent|

prev|

[-]

> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...

That.... doesn't seem true? At least for most architectures I looked at?

While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.

And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?

Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.

So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.

by adgjlsfhk12 hours ago|

parent|

[-]

This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.

by kimixa19 minutes ago|

parent|

[-]

Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter".

That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".

by adgjlsfhk15 hours ago|

parent|

prev|

[-]

> languages that care about performance to default to fp32

What do you mean by this? In C 1.0 is a double.

by kimixa3 hours ago|

parent|

[-]

But the "float" typename is generally fp32 - if we assume the "most generically named type" is the "default". Though this is a bit of an inconsistency with C - the type name "double" surely implies it's double the expected baseline while, as you mentioned, constants and much of libm default to 'double'.

by Sharlin3 hours ago|

prev|

[-]

Yeah, and even on CPU using doubles is almost unheard of in many fields.