If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
That.... doesn't seem true? At least for most architectures I looked at?
While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".