undefined

points

[-]

Kimi uses INT4 as its native format, there's no such thing as "better than 4-bit precision" for that model. This is in contrast with GLM for which 16-bit precision is native and 8-bit is in common use.

by hassaanr9 hours ago|

parent|

[-]

You’re right, but this poses a separate issue as the providers then do FP4 PTQ, which is quite lossy. Reduces the model size and optimizes for Blackwells at the (imo severe) cost of performance.

by unrvl2213 hours ago|

prev|

[-]

MI355X can perform FP6 operations with the same speed as their FP4 (unique to AMD) - people should be making MXFP6 quants which would be pretty much lossless, and much closer to FP4 performance than FP8

by Hugsun5 hours ago|

parent|

[-]

That can only be true if the workload is compute bound, not memory bandwidth bound.

by minraws4 hours ago|

prev|

[-]

Doesn't Nvidia with their NVFP4 claim that it's lossless?

I haven't tested enough models Nvidia has converted to NVFP4 besides GLM 5.2 but it seemed fine to me.

My own luck has been hit or miss with it.

by HDThoreaun25 minutes ago|

parent|

[-]

Certainly not lossless. Whether the loss matters depends on the range of values being quantized. When there are outliers that are massively higher than their neighbors the precision of those neighbors gets wrecked(or the outlier gets clipped), so it's important to utilize strategies that decrease the maximum value or increase the minimum. I suspect some models put more effort into that and therefore are more effective when quantized.

by google23412315 hours ago|

prev|

[-]

First thing I noticed as well

by tw198414 hours ago|

prev|

[-]

from memory, it is like 96-98% of the accuracy.

by lgessler14 hours ago|

parent|

[-]

Accuracy isn't a meaningful metric here without reference to a specific task.

by flawn10 hours ago|

parent|

[-]

Additionally, I'd imagine quantization to have more side-effects than just slightly lower performance (on whatever task). You are basically removing information, and that information could be by chance what the model needs to fulfill it exactly the way you'd want to do - although it's still fully capable. I am not sure if this is really different from "lower performance" but open to hear your opinions.

by EduardoBautista13 hours ago|

parent|

prev|

[-]

And that 2%-4% makes all the difference.

by fpaf11 hours ago|

parent|

[-]

Yes, it's like saying "we took off a big chunk of his brain but look! He can still breathe autonomously, swallow food and walk almost straight, which is like 95% of what he did before!"