I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.
I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.
There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.
Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)
Because the entire reason we use LLMs is to supposedly improve productivity?
Specifying the problem is not extra work separate from solving it. If you skip that step, the ambiguity gets pushed into the model’s assumptions. Then you get a plausible looking answer to the wrong problem and have to waste time backing out of it.
LLMs are not magic machines that can read your mind.
In my own work, it's usually been a few critical assumptions the model made silently (and I never even though of initially) that end up being the difference between passable results the first try, and me having to go back and fix things. Occasionally some questions force me to rethink the problem entirely.
I basically always begin any long-running session with this kind of brainstorming. I don't find the existing plan modes in Claude Code/Codex to be critical enough.
Minimizes effort, is the obvious answer.
I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.
> I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.
This is exactly the benefit for most people.
Most people don't want to code the app, they just want the app.
Even people like us who do like coding, we can only think of all of these things within a domain that we already know; somebody who writes shaders for games isn't likely to know or care much about the ins and outs of database development or how healthcare privacy law and KYC interact with zero-knowledge proofs.
(Of course, if the AI knows about these things and then completely fails to make use of that knowlege, that's still a fail).
Its not my experience Opus is leagues ahead or even superior, but in any case, since GPT 5.5 has Instant, Medium, High, Extra High and Pro...Should the comparison be with GPT on Pro, instead of Extra High as it seems to be the case in the table?
There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.
You guys are all a lost cause.
4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight
Fable 5 is a different beast however.
At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.
It's also possible that it's just a harness problem more than model.
Tool expectations