undefined

points

[-]

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

by 22 hours ago|

parent|

[-]

deleted

by bityard22 hours ago|

parent|

prev|

[-]

Halving the precision of the weights is not a free lunch...

by Catloafdev20 hours ago|

parent|

[-]

Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.

by rvba11 hours ago|

parent|

[-]

Any source that confirms the 99.99% quality?

by bitexploder1 days ago|

prev|

[-]

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

by gchamonlive1 days ago|

prev|

[-]

[dead]