undefined

upvote

points

by _34514 hours ago |

upvote

by nsingh213 hours ago|

[-]

Why supply underspecified requirements in the first place? Both models are good at challenging assumptions/edge cases and asking questions to clarify, but seemingly only when explicitly asked (i.e. something like a "brainstorm" skill).

I don't think either harnesses do enough to encourage the model to challenge all assumptions and ask questions, maybe because users might find it annoying. That step is basically a requirement IMO.

I've found all of the GPT-5 models to be very nit-picky, useful for code review and mathematics (important for my work), but seemingly gets in the way of "aesthetic" code, e.g. overly defensive code to cover all edge cases, even if unlikely.

There is seemingly also a tradeoff between flexibility vs instruction following. In my experience Opus will sometimes ignore instructions but can "fill in the blanks" more, vs GPT-5.5 follows instructions better but perhaps at the cost of rigidity.

reply

upvote

by fooker13 hours ago|

[-]

> Why supply underspecified requirements in the first place?

Because you'd not want to forever loop outside your home when asked to "while you're out, grab some eggs" :)

reply

upvote

by reactordev9 hours ago|

[-]

Meaning why not leave home with your grocery list?

reply

upvote

by iLoveOncall11 hours ago|

[-]

> Why supply underspecified requirements in the first place?

Because the entire reason we use LLMs is to supposedly improve productivity?

reply

upvote

by nsingh210 hours ago|

[-]

Refusing to sufficiently specify a task and hoping the model guesses correctly is not being productive. Again, these models still don't really ask questions when they should. You have to explicitly tell them to.

Specifying the problem is not extra work separate from solving it. If you skip that step, the ambiguity gets pushed into the model’s assumptions. Then you get a plausible looking answer to the wrong problem and have to waste time backing out of it.

LLMs are not magic machines that can read your mind.

reply

upvote

by iLoveOncall10 hours ago|

[-]

My point is that it is much faster for me to solve the problem by writing the code than to write specifications detailed enough for the model to do the right thing in the right way.

reply

upvote

by nsingh210 hours ago|

[-]

A highly detailed specification is not what I mean here. It's closer to plugging in a few sentence descriptions (or a totally cluttered brain dump) and having the model interview you to help pin down critical details before continuing.

In my own work, it's usually been a few critical assumptions the model made silently (and I never even though of initially) that end up being the difference between passable results the first try, and me having to go back and fix things. Occasionally some questions force me to rethink the problem entirely.

I basically always begin any long-running session with this kind of brainstorming. I don't find the existing plan modes in Claude Code/Codex to be critical enough.

reply

upvote

by reactordev9 hours ago|

[-]

You should try transcribing while you speak. Then you can explain and articulate the task sufficiently that the model should have enough context to complete the task to your satisfaction. Since you won’t write it.

reply

upvote

by mejutoco7 hours ago|

[-]

This assumes someone not articulate in writing will be articulate in talking. The most likely outcome is there will be more text with the same information. One can do a little interpretative dance as well but the clearer the requirements the better the result.

reply

upvote

by iLoveOncall8 hours ago|

[-]

My colleagues will thank me for speaking non-stop right next to them surely.

reply

upvote

by antonvs13 hours ago|

[-]

> Why supply underspecified requirements in the first place?

Minimizes effort, is the obvious answer.

reply

upvote

by cyberpunk12 hours ago|

[-]

Poor trade off, the model is then designing a massive chunk of your solution instead of you. With a good spec, bits of typo’d pseudocode, and slightly more effort than a couple of sentences they can actually produce passable software.

I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.

For those who can, I can’t find much of a difference between them. Codex has the slight edge, but that’s all just “feels” to me.

reply

upvote

by ben_w12 hours ago|

[-]

You call it a poor trade off, but:

> I think the reason claude has so much mindshare is exactly because it’s more useful to non-developers who wouldn’t know how to describe what an api call executes to his grandmother.

This is exactly the benefit for most people.

Most people don't want to code the app, they just want the app.

Even people like us who do like coding, we can only think of all of these things within a domain that we already know; somebody who writes shaders for games isn't likely to know or care much about the ins and outs of database development or how healthcare privacy law and KYC interact with zero-knowledge proofs.

(Of course, if the AI knows about these things and then completely fails to make use of that knowlege, that's still a fail).

reply

upvote

by root-parent8 hours ago|

[-]

The best benchmarks are the ones you create yourself.

Its not my experience Opus is leagues ahead or even superior, but in any case, since GPT 5.5 has Instant, Medium, High, Extra High and Pro...Should the comparison be with GPT on Pro, instead of Extra High as it seems to be the case in the table?

reply

upvote

by d4rkp4ttern8 hours ago|

[-]

I didn’t know you could get the “Chat-GPT-5.5 Pro” (the one that’s been solving Erdos problems) inside codex-cli, or maybe I misunderstood?

reply

upvote

by Terretta8 hours ago|

[-]

And, in turn, Opus with ultracode?

reply

upvote

by CSMastermind13 hours ago|

[-]

Man I don't know if I'm living in a crazy bubble or something but GPT 5.5 is lightyears better than Opus 4.8 for me to the point where I'm honestly wondering how you're evaluating them or what kind of work you're doing.

There's specific tasks that Opus does better on like Frontend Dev and Design but for anything else 5.5 just laps it.

reply

upvote

by dools12 hours ago|

[-]

Yeah I’ve been consistently underwhelmed by anthropic models, but then I don’t use their harness so maybe that’s it

reply

upvote

by wwind12311 hours ago|

[-]

In my experience, for more mechanical refactoring work (like splitting a big source code file into multiple smaller ones), GPT 5.5 runs way faster than any of the Claude models. But for other tasks that require deeper reasoning, it's not that clear who is the winner.

reply

upvote

by iLoveOncall11 hours ago|

[-]

It's just too funny to see people arguing about "no, it's my religion that's the right one!" on HackerNews.

You guys are all a lost cause.

reply

upvote

by goosejuice8 hours ago|

[-]

How is attempting to benchmark llms like religion?

reply

upvote

by iLoveOncall8 hours ago|

[-]

Re-read the comment I'm replying to, it's not talking about benchmarks, just models.

reply

upvote

by m3kw93 hours ago|

[-]

Better for vibe coders who always under specify. But at what point does it know you are under specifying but you have properly specified and it did it over your specification?

reply

upvote

by zuzululu13 hours ago|

[-]

same observation here opus 4.8 (and i dont understand the people defending gpt 5.5 constantly) was significantly mature, it would even push back against anything off putting where as GPT 5.5 will happily agree and do what is asked but I would note that it takes several tries.

4.8 also requires more than one prompt but its output is significantly higher quality and offers more insight

Fable 5 is a different beast however.

reply

upvote

by re-thc14 hours ago|

[-]

> It's so good at taking underspecified requirements and filling in the gaps with sensible approaches for your project.

At a high level. It misses low level or other non-functional requirements differently so I wouldn't say Opus is just strictly better.

It's also possible that it's just a harness problem more than model.

reply

upvote

by e913 hours ago|

[-]

I agree with you on the harness. I find that Claude can be good in any harness but GPT is only superior inside Codex.

reply

upvote

by hypfer12 hours ago|

[-]

Similarly, it explains to me why people found Claude so amazing, while I just thought "eh."

Tool expectations

reply