undefined

points

by lostmsu11 hours ago |

comments

by alex4357811 hours ago|

[-]

Quants will push it below 256GB without completely lobotomizing it.

by lostmsu8 hours ago|

parent|

[-]

> without completely lobotomizing it

The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.

by lambda5 hours ago|

parent|

[-]

In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)

So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.

Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.

by Wowfunhappy3 hours ago|

parent|

[-]

> In general, quantizing down to 6 bits gives no measurable loss in performance.

...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?

by lostmsu3 hours ago|

parent|

prev|

[-]

Did you run say SWE Bench Verified? Where does this claim coming from? It's just an urban legend.

by bertili11 hours ago|

prev|

[-]

Most certainly not, but the Unsloth MLX fits 256GB.

by embedding-shape11 hours ago|

parent|

[-]

Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.

by regularfry9 hours ago|

parent|

[-]

They're claiming 20+tps inference on a macbook with the unsloth quant.

by embedding-shape6 hours ago|

parent|

[-]

Yeah, I'm guessing the Mac users still aren't very fond of sharing the time the prefill takes, still. They usually only share the tok/s output, never the input.

by margorczynski10 hours ago|

prev|

[-]

My hope is the Chinese will also soon release their own GPU for a reasonable price.