upvote
> without completely lobotomizing it

The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.

reply
In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)

So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.

Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.

reply
> In general, quantizing down to 6 bits gives no measurable loss in performance.

...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?

reply
Did you run say SWE Bench Verified? Where does this claim coming from? It's just an urban legend.
reply