undefined

points

by granzymes6 hours ago |

comments

by the_duke6 hours ago|

[-]

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

by NitpickLawyer6 hours ago|

parent|

[-]

Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

by kilroy1235 hours ago|

parent|

[-]

Personally, I have Claude do the coding. Then 5.2-high do the reviewing.

by mmaunder1 hours ago|

parent|

[-]

I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.

by seunosewa5 hours ago|

parent|

prev|

[-]

Then I pass the review back to Claude Opus to implement it.

by VladVladikoff5 hours ago|

parent|

[-]

Just curious is this a manual process or you guys have automated these steps?

by ricketycricket4 hours ago|

parent|

[-]

I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.

by bryanlarsen2 hours ago|

parent|

[-]

Mind sharing the skill/prompt?

by dror18 minutes ago|

parent|

[-]

Not the OP, but I use the same approach.

https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...

I typically use github issues as the unit of work, so that's part of my instruction.

by _zoltan_3 hours ago|

parent|

prev|

[-]

zen-mcp (now called pal-mcp I think) and then claude code can actually just pass things to gemini (or any other model)

by kilroy1232 hours ago|

parent|

prev|

[-]

Sometimes, depends on how big of a task. I just find 5.2 so slow.

by _zoltan_3 hours ago|

parent|

prev|

[-]

I have Opus 4.5 do everything then review it with Gemini 3.

by StephenHerlihyy5 hours ago|

parent|

prev|

[-]

I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.

by readyforbrunch3 hours ago|

parent|

[-]

How do you orchestrate this workflow? Do you define different skills that all use different models, or something else?

by aurareturn6 hours ago|

parent|

prev|

[-]

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

by koakuma-chan6 hours ago|

parent|

[-]

Opus 4.5 is more creative and better at making UIs

by fooker6 hours ago|

parent|

prev|

[-]

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

by bunderbunder4 hours ago|

parent|

[-]

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

by abustamam3 hours ago|

parent|

[-]

Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

by mrandish5 hours ago|

parent|

prev|

[-]

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

by fooker5 hours ago|

parent|

[-]

For the current state of AI, the harness is unfortunately part of the secret sauce.

by scoring17744 hours ago|

parent|

prev|

[-]

This has been done: https://arxiv.org/abs/2510.04871v1

by mmaunder1 hours ago|

parent|

prev|

[-]

ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.

by nerdsniper5 hours ago|

parent|

prev|

[-]

Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

by nubg4 hours ago|

parent|

[-]

what do you do?

by audience_mem1 hours ago|

parent|

[-]

He works on brain-melting stuff, the understanding of which is far beyond us.

by jahsome6 hours ago|

parent|

prev|

[-]

Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

by StephenHerlihyy5 hours ago|

parent|

[-]

What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.

by Rudybega2 hours ago|

parent|

[-]

The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.

by SatvikBeri4 hours ago|

parent|

prev|

[-]

I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test

by clhodapp5 hours ago|

parent|

prev|

[-]

And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...

by malshe6 hours ago|

parent|

prev|

[-]

This pretty accurately summarizes all the long discussions about AI models on HN.

by cactusplant73745 hours ago|

parent|

prev|

[-]

Hourly occurrence on /r/codex. Model astrology is about the vibes.

by wasmainiac6 hours ago|

parent|

prev|

[-]

[flagged]

by nocman5 hours ago|

parent|

[-]

> Who are making these claims? script kiddies? sr devs? Altman?

AI agents, perhaps? :-D

by locknitpicker5 hours ago|

parent|

prev|

[-]

> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?

You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.

by andrepd4 hours ago|

parent|

[-]

It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.

by BoredPositron5 hours ago|

parent|

prev|

[-]

When you keep his ramblings on twitter or company blog in mind I bet he is a shit poster here.

by leumon5 hours ago|

prev|

[-]

they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

Claude Opus 4.5-reasoning: $1485

(and probably similar values for the newer models?)

by redox995 hours ago|

parent|

[-]

With $20 gpt plan you can use xhigh no problem. With $20 Claude plan you reach the 5h limit with a single feature.

by mattkevan4 hours ago|

parent|

[-]

Ha, Claude Code on a pro plan often can't complete a single message before hitting the 5h limit. Not hit it once so far on Codex.

by naths884 hours ago|

parent|

[-]

This, so frustrating. But CC is so much faster too.

by Computer04 hours ago|

parent|

prev|

[-]

A provider's API costs seemingly do not reflect each respective SOTA provider's subscription usage allowances.

by __jl__6 hours ago|

prev|

[-]

Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...

by granzymes6 hours ago|

parent|

[-]

Insane! I think this has to be the shortest-lived SOTA for any model so far. Competition is amazing.

by wilg5 hours ago|

prev|

[-]

In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.

by dudeinhawaii4 hours ago|

parent|

[-]

I think for many/most programmers = 'speed + output' and webdev == "great coding".

Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.

But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.

Solution, use both as needed!

by falloutx3 hours ago|

parent|

[-]

> I actually do prefer Claude for webdev

Ah and let me guess all your frontends look like cookie cutter versions of this: https://openclaw.dog/

by Yiin2 hours ago|

parent|

[-]

Yes and I love it.

by whynotminot4 hours ago|

parent|

prev|

[-]

> Solution, use both as needed!

This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.

Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.

by soulofmischief4 hours ago|

parent|

prev|

[-]

GPT 5.2 codex plans well but fucks off a lot, goes in circles (more than opus 4.5) and really just lacks the breadth of integrated knowledge that makes opus feel so powerful.

Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.

by wilg2 hours ago|

parent|

[-]

I always use 5.2-Codex-High or 5.2-Codex-Extra High (in Cursor). The regular version is probably too dumb.

by soulofmischief1 hours ago|

parent|

[-]

Didn't make a difference for me. Though I will say, so far 4.6 is really pissing me off and I might downgrade back to 4.5. It just refuses to listen to what I say, the steering is awful.

by fragmede2 hours ago|

parent|

prev|

[-]

How many people are building the same thing multiple times to compare model performance? I'm much more interested in getting the thing I'm building getting built, than than comparing AIs to each other.

by jronak5 hours ago|

prev|

[-]

Did you look at the ARC AGI 2? Codex might be overfit for terminal bench

by tedsanders5 hours ago|

parent|

[-]

ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.

by 2 minutes ago|

parent|

[-]

deleted

by mrandish3 hours ago|

parent|

prev|

[-]

A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.

The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...

I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.

by janalsncm4 hours ago|

parent|

prev|

[-]

More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.

As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.

by nurettin6 hours ago|

prev|

[-]

Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

Hopefully performance will pick up after the rollout.

by nickstinemates4 hours ago|

parent|

[-]

Did you give it any architecture guidance? An architecture skill that it can load to make sure it lays out things according to your taste?

by nurettin2 hours ago|

parent|

[-]

Yes, it has a very tight CLAUDE.md which it used to follow. Feels like this happens a couple of times a month.