undefined

points

[-]

I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

by jayGlow1 hours ago|

parent|

[-]

what are you using for memory with your local models? is there a specific harness you would recommend for local agents?

by trilogic35 minutes ago|

parent|

[-]

I use HugstonOne (that backend a personalized version of llama.cpp). Implemented it´s own double layer memory that recall the full or partial previous session/file with an ON/OFF switch (which picks up where left off in CLI or Server or both same time) and another that reads back a % of current tab if memory switch is off doing checkpoints every certain tokens, summarizing and referring back to it when needed (recalled by certain logics). There is more to it when involving local RAG (making it tripple memory layer) but thats a long story.

About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).

The full pack is Harness 10, every missing thing lower the harness score.

To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.

by kpw942 hours ago|

prev|

[-]

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.

Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.

Huge Thank you for llama.cpp btw!!

by ggerganov2 hours ago|

parent|

[-]

Here are the prefill speeds:

    Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
  | model                          |       size |     params | backend  |  fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |   pp2048 @ d512 |      3714.02 ± 10.85 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d1024 |      3684.86 ± 15.21 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d2048 |       3650.80 ± 8.53 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d8192 |       3473.88 ± 0.97 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 | pp2048 @ d32768 |       2754.69 ± 4.07 |

  ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Ultra)
  | model                          |       size |     params | backend  | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |   pp2048 @ d512 |        379.75 ± 0.21 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d1024 |        377.15 ± 0.35 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d2048 |        371.46 ± 0.91 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d8192 |        344.84 ± 0.41 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 | pp2048 @ d32768 |        222.42 ± 5.29 |

Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.

Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.

[0] https://github.com/ggml-org/llama.cpp/pull/19164

by kpw941 hours ago|

parent|

[-]

Thanks! Super helpful.

I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)

At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.

It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.

by 2 hours ago|

prev|

[-]

deleted

by toddmorey52 minutes ago|

prev|

[-]

For the curious, it looks like a PC with a RTX 5090 32GB graphics card will run you about $6,000.

by celrod3 hours ago|

prev|

[-]

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

by ggerganov2 hours ago|

parent|

[-]

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...

by fridder2 hours ago|

prev|

[-]

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell