undefined

points

by gen22021 hours ago |

[-]

For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever

by csvance19 hours ago|

parent|

[-]

When doing big long running workflows especially with plan Mode 4.7 was a clear improvement. It’s considerably worse for under specified tasks and responds to a couple sentences with 10+ paragraphs for explanatory type discussions.

by themgt19 hours ago|

parent|

[-]

Opus 4.7+ Max is a 10x engineer who wants to be left alone to work. When you talk to him, he infodumps on you to get you (his pointy haired idiot Dilbert boss) to go away.

by 4gotunameagain9 hours ago|

parent|

[-]

OR they deliberately increased token usage to inflate pre IPO numbers.

by fittingopposite9 hours ago|

prev|

[-]

Yes. You and some random indigenous guy in the Amazon likely share the same intelligence but you are more capable because you have access to writing/reading, computer, car etc. Intelligence is more than raw intelligence. It's harness, skills, tools, memory etc. If you improve all the latter but keep the raw intelligence (LLM) fixed, you certainly get better results. Same with us humans.

by gen2202 hours ago|

parent|

[-]

Of course, I’m not trying to dismiss gains from harness, actually the opposite.

But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.

If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).

So differentiating between the two (what I’m trying to do here) is really consequential!

by computably9 hours ago|

parent|

prev|

[-]

Except LLMs are simulacra of actual intelligence. Frequently in a single conversation working on a single narrowly scoped task, I am both surprised by a few insights and cursing at how it can miss obvious issues. The "raw intelligence" of LLMs leaves much to be desired.

by bonoboTP21 hours ago|

prev|

[-]

To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

by onlyrealcuzzo13 hours ago|

prev|

[-]

In my experience, 4.7 was a noticeable step down from 4.6.

I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.

And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."

Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...

Codex is also way faster.

by somenameforme21 hours ago|

prev|

[-]

They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.

by bcrosby9520 hours ago|

prev|

[-]

4.6 felt a bit better than 4.5 but slower. 4.7 doesn't feel better than 4.6.

by giraffe_lady21 hours ago|

prev|

[-]

I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.

There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.

by alfalfasprout18 hours ago|

prev|

[-]

I'm actually currently studying this :)

Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.