Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.
If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.
MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be 288 HBM3e (and support FP4/6).
As long as AMD fixes the damn driver issues I've seen for over a decade.
Nvidia about to release blackwell ultra with 288GB. Go back to maybe 2018 and max was 16gb if memory serves.
DeepSeek recently release a 670 gb model. A couple years ago Falcon's 180gb seemed huge.
We've been stuck with the same general caps on standard GPU memory since then though. Perhaps limited in part because of the generational upgrades happening in the bandwidth of the memory, rather than the capacity.
A one time effective 30% reduction in model size simply isn't going to be some massive unlocker, in theory or in practice.