Gerganov, hope you will consider developing further the CLI cause we suffering with the server.
About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).
The full pack is Harness 10, every missing thing lower the harness score.
To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.
Curious if you can share the prefill speed too?
I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.
Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.
Huge Thank you for llama.cpp btw!!
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d512 | 3714.02 ± 10.85 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d1024 | 3684.86 ± 15.21 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d2048 | 3650.80 ± 8.53 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d8192 | 3473.88 ± 0.97 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d32768 | 2754.69 ± 4.07 |
ggml_metal_device_init: GPU name: MTL0 (Apple M2 Ultra)
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d512 | 379.75 ± 0.21 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d1024 | 377.15 ± 0.35 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d2048 | 371.46 ± 0.91 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d8192 | 344.84 ± 0.41 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d32768 | 222.42 ± 5.29 |
Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)
At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.
It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.
[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...