I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."
I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.
Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?
https://www.federalregister.gov/documents/2018/10/24/2018-22... What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."
Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?
Only two of these examples meet the definition of vector processor, and these are very clearly classical vector processor computers, the Cray X1E and the NEC SX-8 (as in, if you're preparing a guide on historical development of vector processing, you're going to be explicitly including these systems or their ancestors as canonical examples of what you mean by a vector super computer!). And the definition is pretty clearly tailored to make sure that SIMD units in existing CPUs wouldn't qualify for the definition of vector processor.
The interesting case to point out is the last example, a "Hypothetical coprocessor-based Server" which hypothetically describes something that is actually extremely similar to the result of GPGPU-based HPC systems: "The host microprocessor is a quad-core (4 processors) chip, and the coprocessor is a specialized chip with 64 floating-point engines operating in parallel, attached to the host microprocessor through a specialized expansion bus (HyperTransport or CSI-like)." This hypothetical system is not a "vector processor," it goes on to explain.
From what I can find, it seems that neither NVidia nor the US government considers the GPUs to count as vector processors and thus give it the 0.3 rather than the 0.9 weight.
I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf
Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)
While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.
And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.
The reason is that the exponent range of FP64 is typically sufficient to avoid overflows and underflows in most applications.
On the other hand, the exponent range of FP32 is insufficient for most scientific-technical computing.
For an adequate exponent range, you must use either three FP32 per FP64, or two FP32 and an integer. In this case the emulation becomes significantly slower than the simplistic double-single emulation.
With the simpler double-single emulation, you cannot expect to just plug it in most engineering applications, e.g. SPICE for electronic circuit simulation, and see that the application works. Some applications could be painstakingly modified to work with such an implementation, but that is not normally an option.
So to be interchangeable with the use of standard FP64 you really must also emulate the exponent range, at the price of much slower emulation.
I did this at some point in the past, but today it makes no sense in comparison with the available alternatives.
Today, the best FP64 performance per dollar by far, is achieved with Ryzen 9950X or Ryzen 9900X, in combination with Inter Battlemage B580 GPUs.
When money does not matter, you can use AMD Epyc in combination with AMD "datacenter" GPUs, which would achieve much better performance per watt, but the performance per dollar would be abysmally low.
FWIW, my own example (emulating doubles/quads with ints) gives the full exponent range with no wasted bits since I’m just emulating IEEE format directly.
Of course there are also bignum libraries that can do arbitrary precision. I guess one of the things I meant to convey but didn’t say directly is that using double precision isn’t export controlled, as one might interpret the top of thi thread, but a certain level of fp64 performance might be.