undefined

points

[-]

I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.

My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.

I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.

I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.

That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.

Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.

by oren15311 hours ago|

parent|

[-]

Good point on the green/red dashboard. The opportunity cost angle is worth adding though. A failed run isn't just the wasted tokens and retry cost - it's also the task that didn't get done and the engineering required to diagnose why. On anything time-sensitive, that compounds fast.

by Bombthecat1 hours ago|

parent|

prev|

[-]

In 12 months, opus will be better than now and you still won't use it lol

by cesarvarela38 minutes ago|

prev|

[-]

The harness makes a difference too.