undefined

points

[-]

I was hoping this would be the model to replace our Qwen3.5-27B, but the difference is marginally small. Too risky, I'll pass and wait for the release of a dense version.

by Mikealcl14 hours ago|

prev|

[-]

Could you explain why prompt processing is the bottle neck please? I've seen this behavior but I don't understand why.

by zozbot23414 hours ago|

parent|

[-]

You should be able to save a lot on prefill by stashing KV-cache shared prefixes (since KV-cache for plain transformers is an append-only structure) to near-line bulk storage and fetching them in as needed. Not sure why local AI engines don't do this already since it's a natural extension of session save/restore and what's usually called prompt caching.

by FuckButtons8 hours ago|

parent|

[-]

if I understand you correctly, this is essentially what vllm does with their paged cache, if I’ve misunderstood I apologize.

by zozbot2347 hours ago|

parent|

[-]

Paged Attention is more of a low-level building block, aimed initially at avoiding duplication of shared KV-cache prefixes in large-batch inference. But you're right that it's quite related. The llama.cpp folks are still thinking about it, per a recent discussion from that project: https://github.com/ggml-org/llama.cpp/discussions/21961