undefined

points

by zackangelo17 hours ago |

comments

by Schiendelman16 hours ago|

[-]

You're right - Rubin is better at NVFP4 training, not inference, thank you for catching me!

by boroboro416 hours ago|

parent|

[-]

What does it mean it's better at nvfp4 training? What's different between training and inference to make this true?

by Schiendelman15 hours ago|

parent|

[-]

We're getting to the limit of my understanding, but I believe most Blackwell users still usually run FP8 passes through the transformer engine - they'll just store weights at NVFP4. Nvidia has model-specific stabilization recipes for NVFP4 end to end, but they're taking fixes all the time.

Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.

But yeah, grain of salt - we haven't seen this in practice.

by fc417fc80215 hours ago|

parent|

prev|

[-]

I'm also puzzled by that statement. The issue with training is (as I understand it) one of precision and the associated numerical stability. You need enough bits in order for backprop to function correctly.

Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.

You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?

by Schiendelman26 minutes ago|

parent|

[-]

See my reply to the GP comment!