(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)
Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.
But yeah, grain of salt - we haven't seen this in practice.
Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.
You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?