The tricky part is that the "number of tokens to good result" does absolutely vary, and you need a decent harness to make it work without too much manual intervention, so figuring out which model is most cost-effective for which tasks is becoming increasingly hard, but several are cost-effective enough.
Codex is just so much better, or the genera GPT models.
https://github.blog/news-insights/company-news/changes-to-gi...
Substantially worse at following instructions and overoptimized for maximizing token usage
I do some stuff with gemini flash and Aider, but mostly because I want to avoid locking myself into a walled garden of models, UIs and company
If you're feeling frisky, Zed has a decent agent harness and a very good editor.
Opencode was getting there, but it seems the founders lost interest. Pi could be it, but its very focused on OpenClaw. Even Codex cli doesnt have all of it.
which harness works well with Deepseek v4 ?
So while I agree mixed model is the way to go, opus is still my workhorse.
Not saying it is better or worse, but the way I perpersonally prefer is to design in chat, to make sure all unknown unknown are addressed
In contrast ChatGPT 5.3 and also Opus has a 90% rate at least on this same project. (Embedded)
All other tests were the same. What are you doing with these models?