undefined

points

[-]

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

by themanualstates2 hours ago|

parent|

[-]

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

by halJordan51 minutes ago|

parent|

[-]

The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.

by nateb20222 hours ago|

parent|

prev|

[-]

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

by Terretta1 hours ago|

parent|

[-]

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.

by ridiculous_leke1 hours ago|

parent|

prev|

[-]

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

by mattmanser2 hours ago|

parent|

prev|

[-]

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

by stymaar1 hours ago|

parent|

[-]

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

by embedding-shape1 hours ago|

parent|

[-]

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

by c0rruptbytes1 hours ago|

parent|

prev|

[-]

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

by greenavocado1 hours ago|

parent|

prev|

[-]

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

by greenavocado1 hours ago|

parent|

prev|

[-]

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

by stymaar1 hours ago|

parent|

[-]

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

by greenavocado1 hours ago|

parent|

[-]

You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.

by c0rruptbytes1 hours ago|

parent|

prev|

[-]

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough