upvote
Qwen3.5 series is a little bit of an exception to the general rule here. It is incredibly kv cache size efficient. I think the max context (262k) fits in 3GB at q8 iirc. I prefer to keep the cache at full precision though.
reply
I just tested it and have to make a correction. With llama.cpp, 262144 tokens context (Q8 cache) used 8.7 GB memory with Qwen3.6 27B. Still very impressive.
reply
The MoE variants are more cache efficient. Here from Qwen3.6 35B A3B MoE with 256k (262144) context at full F16 (so no cache quality loss):

  llama_kv_cache: size = 5120.00 MiB (262144 cells,  10 layers,  4/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
The MXFP4-quantized variant from Unsloth just fits my 5090 with 32GB VRAM at 256k context.

Meanwhile here's for Qwen 3.6 27B:

  llama_kv_cache: size = 3072.00 MiB ( 49152 cells,  16 layers,  4/1 seqs), K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
So 16 tokens per MiB for the 27B model vs about 51 tokens per MiB for the 35B MoE model.

I went for the Q5 UD variant for 27B so could just fit 48k context, though it seems if I went for the Q4 UD variant I could get 64k context.

That said I haven't tried the Qwen3.6 35B MoE to figure out if it can effectively use the full 256k context, that varies from model to model depending on the model training.

reply
I found the KLD benchmark image at the bottom of https://unsloth.ai/docs/models/qwen3.6 to be very helpful when choosing a quant.
reply
Yea, I'm also kind of jealous of Apple folks with their unified RAM. On a traditional homelab setup with gobs of system RAM and a GPU with relatively little VRAM, all that system RAM sits there useless for running LLMs.
reply
That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput.
reply
deleted
reply
Strix Halo is another option
reply
qwen3.5 27b w/ 4bit quant works reasonably on a 3090.
reply