undefined

points

[-]

Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)

by girvo1 hours ago|

parent|

[-]

It really isn’t. I wish it was, because work complains about overuse of Opus.

by rafaelmn3 hours ago|

prev|

[-]

GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

by sho_hn2 hours ago|

parent|

[-]

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)

by boring-human1 hours ago|

parent|

[-]

Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.

by Jcampuzano23 hours ago|

parent|

prev|

[-]

Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

by camdenreslink15 minutes ago|

parent|

prev|

[-]

ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).

by leobuskin3 hours ago|

parent|

prev|

[-]

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

by blazespin1 hours ago|

parent|

[-]

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.

by porker1 hours ago|

parent|

[-]

What is RE in this context?

by astrange1 hours ago|

parent|

[-]

Reverse engineering

by zarzavat3 hours ago|

parent|

prev|

[-]

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

by lilytweed3 hours ago|

parent|

[-]

Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

by kranke15522 minutes ago|

parent|

[-]

GPT was clearly changed after its sycophantic models lead to the lawsuits.

by chaos_emergent2 hours ago|

parent|

prev|

[-]

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

by whalesalad3 hours ago|

parent|

prev|

[-]

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

by ctoth2 hours ago|

parent|

[-]

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!