Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d512 | 3714.02 ± 10.85 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d1024 | 3684.86 ± 15.21 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d2048 | 3650.80 ± 8.53 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d8192 | 3473.88 ± 0.97 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d32768 | 2754.69 ± 4.07 |
ggml_metal_device_init: GPU name: MTL0 (Apple M2 Ultra)
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d512 | 379.75 ± 0.21 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d1024 | 377.15 ± 0.35 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d2048 | 371.46 ± 0.91 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d8192 | 344.84 ± 0.41 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d32768 | 222.42 ± 5.29 |
Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)
At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.
It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.