undefined

points

[-]

It's very unclear what's special in Rubin to be optimized for inference? I can see disaggregated bit (with having separate prefill and decoding nodes), but what else?

by villgax15 hours ago|

parent|

[-]

Lot more SMs & Tensor Cores for NVFP4 going by the looks of it.

by nullc18 hours ago|

prev|

[-]

how do you get 5x faster at inference when inference is memory bandwidth limited? getting 5x the memory bandwidth of a h100 seems physically difficult.

by Schiendelman17 hours ago|

parent|

[-]

Rubin has 22TB/s of memory bandwidth vs Blackwell's 8TB/s. NVLink 6 doubles interconnect speed. Plus they're moving to 3nm from ~4nm.

(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)

by boredatoms17 hours ago|

parent|

[-]

Moving to lower bits is not a slam dunk, the model itself might degrade too much

by Schiendelman16 hours ago|

parent|

[-]

Of course, but for most workflows it's fine.

by zackangelo17 hours ago|

parent|

prev|

[-]

Blackwell supports nvfp4 natively.

by Schiendelman16 hours ago|

parent|

[-]

You're right - Rubin is better at NVFP4 training, not inference, thank you for catching me!

by boroboro416 hours ago|

parent|

[-]

What does it mean it's better at nvfp4 training? What's different between training and inference to make this true?

by Schiendelman15 hours ago|

parent|

[-]

We're getting to the limit of my understanding, but I believe most Blackwell users still usually run FP8 passes through the transformer engine - they'll just store weights at NVFP4. Nvidia has model-specific stabilization recipes for NVFP4 end to end, but they're taking fixes all the time.

Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.

But yeah, grain of salt - we haven't seen this in practice.

by fc417fc80215 hours ago|

parent|

prev|

[-]

I'm also puzzled by that statement. The issue with training is (as I understand it) one of precision and the associated numerical stability. You need enough bits in order for backprop to function correctly.

Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.

You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?

by Schiendelman30 minutes ago|

parent|

[-]

See my reply to the GP comment!

by unrvl2213 hours ago|

parent|

prev|

[-]

inference is only memory bandwidth limited when targeting higher tps / high single stream tps. the weights only need to be moved across once per forward pass, when you batch say 100 streams per forward pass (which is what most inference services do / care about) its compute bottlenecked.