undefined

points

[-]

Have you tried Qwen3 Coder Next? I've been testing it with OpenCode and it seems to work fairly well with the harness. It occasionally calls tools improperly but with Qwen's suggested temperature=1 it doesn't seem to get stuck. It also spends a reasonable amount of time trying to do work.

I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.

That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.

by tarruda5 hours ago|

parent|

[-]

I did play with Qwen3 Coder Next a bit, but didn't try it in a coding harness. Will give it a shot later.

by petethepig5 hours ago|

prev|

[-]

Is getting something like M3 Ultra with 512GB ram and doing oss models going to be cheaper for the next year or two compared to paying for claude / codex?

Did anyone do this kind of math?

by tarruda5 hours ago|

parent|

[-]

No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k

And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).

by ipython6 hours ago|

prev|

[-]

Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?

by tarruda5 hours ago|

parent|

[-]

There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...

by sosodev5 hours ago|

parent|

prev|

[-]

I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.

by tarruda4 hours ago|

parent|

[-]

> so llama.cpp just doesn't handle it correctly.

It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

by sosodev4 hours ago|

parent|

[-]

I see. It seems the looping is a bug in the model weights but there are bugs in detecting various outputs as identified in the PR I linked.

by terhechte10 hours ago|

prev|

[-]

Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.

by tarruda9 hours ago|

parent|

[-]

Haven't tried. I'm too used to llama.cpp at this point to switch to something else. I like being able to just run a model and automatically get:

- OpenAI completions endpoint

- Anthropic messages endpoint

- OpenAI responses endpoint

- A slick looking web UI

Without having to install anything else.

by KerrAvon7 hours ago|

parent|

prev|

[-]

Is there a reliable way to run MLX models? On my M1 Max, LM Studio seems to output garbage through the API server sometimes even when the LM Studio chat with the same model is perfectly fine. llama.cpp variants generally always just work.

by lostmsu8 hours ago|

prev|

[-]

gpt-oss 120b and even 20b works OK with codex.

by tarruda8 hours ago|

parent|

[-]

Both gpt-oss are great models for coding in a single turn, but I feel that they forget context too easily.

For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.