undefined

points

[-]

"I've no idea why one would use gpt-oss-20b at Q8" - would you mind expanding on this comment?

In that particular model family, the choices are 20B and 120B, so 20B higher quant fits in VRAM, while you'd be settling for 120B at a lower quant. Is it that 20B MXFP4 is comparable in performance so no need for Q8?

Or is the insight simply that there are better models available now and the emphasis is on gpt-oss-20b, not Q8?

by Mindless21122 days ago|

parent|

[-]

The parameters in the original gpt-oss-20B model are "post-trained with MXFP4 quantization", so there just isn't much to gain by quantizing to Q8. If you look inside the Q8 model, most of the parameters are MXFP4 anyway.

Though, looking inside my "gpt-oss 20B MXFP4 MoE" model, it looks to also be quantized the same way as the Q8, so that was probably an overstatement on my part.

Still, the Q8 is 12.1 GB and the FP16 is 13.8 GB. Not the ~1:2 ratio you might expect.

by ycui72 days ago|

prev|

[-]

At this speed, people end up paying more on electricity than api calls. (California electricity)