For a while I used Cerebras Code for 50 USD a month with them running a GLM model and giving you millions of tokens per day. It did a lot of heavy lifting in a software migration I was doing at the time (and made it DOABLE in the first place), BUT there were about 10 different places where the migration got fucked up and had to manually be fixed - files left over after refactoring (what's worse, duplicated ones basically), some constants and routes that are dead code, some development pages that weren't removed when they were superseded by others and so on.
I would say that Claude Code with throwing Opus at most problems (and it using Sonnet or Haiku for sub-agents for simple and well specified tasks) is actually way better, simply because it fucks things up less often and review iterations at least catch when things are going wrong like that. Worse models (and pretty much every one that I can afford to launch locally, even ones that need around ~80 GB of VRAM in the context of an org wanting to self-host stuff) will be confidently wrong and place time bombs in your codebases that you won't even be aware of if you don't pay enough attention to everything - even when the task was rote bullshit that any model worth its salt should have resolved with 0 issues.
My fear is that models that would let me truly be as productive as I want with any degree of confidence might be Mythos tier and the economics of that just wouldn't work out.
For handing work off to an LLM in large chunks, picking the best model available is the only way to go right now.
I’m curious how to even do it. I have no idea how to choose which model to use in advance of a given task, regardless of the mental overhead.
And unless you can predict perfectly what you need, there’s going to be some overuse due to choosing the wrong model and having to redo some work with a better model, I assume?
Even EMs and TPMs are assigning people based on their previous experience, which generally boils down to "i've seen this task before and I know what's involved," "this task is small, and I know what's involved," or "this task is too big and needs to be understood better."
That's how things worked pre-AI, and old problems are new problems again.
When you run any bigger project, you have senior folks who tackle hardest parts of it, experienced folks who can churn out massive amounts of code, junior folks who target smaller/simpler/better scoped problems, etc.
We don't default to tell the most senior engineer "you solve all of those problems". But they're often involved in evaluation/scoping down/breakdown of problem/supervising/correcting/etc.
There's tons of analogies and decades of industry experience to apply here.
I'm not saying that can't be done, but taking a large task that hasn't been broken down needs, you guessed it, a powerful agent. that's your senior engineer who can figure out the rote parts, the medium parts, and the thorny parts.
the goal isn't to have an engineer do that. we should still be throwing powerful agents at a problem, they should just be delegating the work more efficiently.
throwing either an engineer or an agent at any unexplored work means you just have to delegate the most experienced resource to, or suffer the consequences.