points
-- Q5_K_M Unsloth quantization on Linux llama.cpp
-- context 81k, flash attention on, 8-bit K/V caches
-- pp 625 t/s, tg 30 t/s
Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.