Kimi uses INT4 as its native format, there's no such thing as "better than 4-bit precision" for that model. This is in contrast with GLM for which 16-bit precision is native and 8-bit is in common use.
You’re right, but this poses a separate issue as the providers then do FP4 PTQ, which is quite lossy. Reduces the model size and optimizes for Blackwells at the (imo severe) cost of performance.
MI355X can perform FP6 operations with the same speed as their FP4 (unique to AMD) - people should be making MXFP6 quants which would be pretty much lossless, and much closer to FP4 performance than FP8
Certainly not lossless. Whether the loss matters depends on the range of values being quantized. When there are outliers that are massively higher than their neighbors the precision of those neighbors gets wrecked(or the outlier gets clipped), so it's important to utilize strategies that decrease the maximum value or increase the minimum. I suspect some models put more effort into that and therefore are more effective when quantized.
Additionally, I'd imagine quantization to have more side-effects than just slightly lower performance (on whatever task). You are basically removing information, and that information could be by chance what the model needs to fulfill it exactly the way you'd want to do - although it's still fully capable. I am not sure if this is really different from "lower performance" but open to hear your opinions.
Yes, it's like saying "we took off a big chunk of his brain but look! He can still breathe autonomously, swallow food and walk almost straight, which is like 95% of what he did before!"