But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.
If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).
So differentiating between the two (what I’m trying to do here) is really consequential!
I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.
And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."
Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...
Codex is also way faster.
There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.
4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.
So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.