undefined

points

[-]

you misunderstand what that chart shows - it shows BF16 QAT Q4_0, not BF16 regular.

meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

Like storing small 8 bit numbers in full 32 bit integers.

So it's not close to 100% of unquantized BF16.

I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.

That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...

by coder54323 hours ago|

parent|

[-]

> meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.

You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.

The Gemma 3 QAT report was a bit clearer:

https://developers.googleblog.com/en/gemma-3-quantized-aware...

"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."

The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.

by 3abiton19 hours ago|

parent|

[-]

Are there evidence that this approach helps maintain "accuracy" performance when quantized? It sounds a bit like mxfp4 with gpt-oss, which was a confusing model upon release.

by dofm2 hours ago|

parent|

[-]

I have just been humbled by the Gemma 4 26B QAT build (unsloth's version), which insisted repeatedly that I am wrong in my requirements for some niche wordpress code, which cannot be satisfied.

I am a good WP developer so I kept prodding it and it kept insisting, and it explained with clarity. Turns out it is right and I was wrong, as I would have found out if I'd written the code myself.

I've been using this particular test for days, experimenting in ways to generate and prompt code. The 4-bit quantisation of the pre-QAT model does not catch this error. And nor can the Qwen 3.6 sparse model, which confidently blazed past it and never mentioned it.

(FWIW neither did plain ChatGPT; maybe Codex would)

Anecdotal, but there you go. I am somewhat weirded out by it.

by ComputerGuru20 hours ago|

parent|

prev|

[-]

So what we want now is unsloth (or anyone) to release 4/6-bit quantized models of these releases?

by coder54320 hours ago|

parent|

[-]

Yep, Unsloth already did, as linked in the comment at the top of this thread

by satvikpendem1 days ago|