upvote
Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.
reply
deleted
reply
Halving the precision of the weights is not a free lunch...
reply
Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.
reply
Any source that confirms the 99.99% quality?
reply
It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.
reply
[dead]
reply