upvote
Was the choice of such a small model driven by a desire for high tok/sec? I ask because an m4 pro 48gb machine can run larger models (if model intelligence is the thing that would make it more useful).
reply
Yes that was my goal. Also noticed a huge performance gain going from ollama to mlx. Your mileage may vary.
reply
I'm using the 30b MOE model on same spec with 65k tokens as a sub agent with tooling and it absolutely writes decent code. The dense 9b I agree wasn't great.
reply
Thanks for saying this. There's so much nonsense out there online about local models being better than Opus 4.7 and the like. It's just not true for regular users.

I have a brand new M5 MacBook Pro - top end with all the specs and I've tried local models and they're barely functional.

reply
What models and quantizations have you been trying? I've had great success with the larger Qwen 3.x models at 6-bit levels. Using 6 bit quantization is really the bare minimum to give local models a fair shot at agentic flows. Once you start pushing below that the models become more "dumb" from the limited bit space.
reply
The main benefits for local are:

1) control 2) privacy 3) transparent cost model

Cloud has tremendous value for speed, plug and play, and performance. You need to decide how those compete with the benefits of local - both today, and a year from now, e.g.

reply
How does it (the openrouter version) compare to ChatGPT 5.5 or Claude Opus 4.6?
reply
Good enough. It gets 60-70% of the work I need done for a lot less $ (keep in mind I am using these for personal projects that doesn’t generate revenue). If I was using it with the hopes of making money I think I would just use Codex at this point.
reply
deleted
reply