I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.
That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.
Did anyone do this kind of math?
However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k
And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).
It’s my layman understanding that would have to be fixed in the model weights itself?
It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.
You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635
Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.
It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...
- OpenAI completions endpoint
- Anthropic messages endpoint
- OpenAI responses endpoint
- A slick looking web UI
Without having to install anything else.
For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".
I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.