build/bin/llama-server \
-m ~/models/llm/qwen3.6-27b/qwen3.6-27B-q8_0.gguf \
--no-mmap \
--n-gpu-layers all \
--ctx-size 131072 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--jinja \
--no-mmproj \
--parallel 1 \
--cache-ram 4096 -ctxcp 2 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}'
Should fit nicely in a single 5090: self model context compute
30968 = 25972 + 4501 + 495
Even bumping up to 16-bit K cache should fit comfortably by dropping down to 64K context, which is still a pretty decent amount. I would try both. I'm not sure how tolerant Qwen3.5 series is of dropping K cache to 8 bits.You probably can actually. Not saying that it would be ideal but it can fit entirely in VRAM (if you make sure to quantize the attention layers). KV cache quantization and not loading the vision tower would help quite a bit. Not ideal for long context, but it should be very much possible.
I addressed the lossless claim in another reply but I guess it really depends on what the model is used for. For my usecases, it's nearly lossless I'd say.