undefined

upvote

points

by the_duke8 hours ago |

upvote

by NitpickLawyer7 hours ago|

[-]

Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

reply

upvote

by kilroy1237 hours ago|

[-]

Personally, I have Claude do the coding. Then 5.2-high do the reviewing.

reply

upvote

by mmaunder2 hours ago|

[-]

I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.

reply

upvote

by seunosewa7 hours ago|

[-]

Then I pass the review back to Claude Opus to implement it.

reply

upvote

by VladVladikoff6 hours ago|

[-]

Just curious is this a manual process or you guys have automated these steps?

reply

upvote

by ricketycricket5 hours ago|

[-]

I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.

reply

upvote

by bryanlarsen4 hours ago|

[-]

Mind sharing the skill/prompt?

reply

upvote

by dror1 hours ago|

[-]

Not the OP, but I use the same approach.

https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...

I typically use github issues as the unit of work, so that's part of my instruction.

reply

upvote

by _zoltan_5 hours ago|

[-]

zen-mcp (now called pal-mcp I think) and then claude code can actually just pass things to gemini (or any other model)

reply

upvote

by kilroy1234 hours ago|

[-]

Sometimes, depends on how big of a task. I just find 5.2 so slow.

reply

upvote

by _zoltan_5 hours ago|

[-]

I have Opus 4.5 do everything then review it with Gemini 3.

reply

upvote

by StephenHerlihyy6 hours ago|

[-]

I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.

reply

upvote

by readyforbrunch4 hours ago|

[-]

How do you orchestrate this workflow? Do you define different skills that all use different models, or something else?

reply

upvote

by aurareturn7 hours ago|

[-]

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

reply

upvote

by koakuma-chan7 hours ago|

[-]

Opus 4.5 is more creative and better at making UIs

reply

upvote

by fooker7 hours ago|

[-]

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

reply

upvote

by bunderbunder5 hours ago|

[-]

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

reply

upvote

by abustamam5 hours ago|

[-]

Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

reply

upvote

by bunderbunder1 hours ago|

[-]

Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.

When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.

reply

upvote

by mrandish6 hours ago|

[-]

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

reply

upvote

by fooker6 hours ago|

[-]

For the current state of AI, the harness is unfortunately part of the secret sauce.

reply

upvote

by scoring17745 hours ago|

[-]

This has been done: https://arxiv.org/abs/2510.04871v1

reply

upvote

by mmaunder2 hours ago|

[-]

ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.

reply

upvote

by nerdsniper6 hours ago|

[-]

Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

reply

upvote

by nubg6 hours ago|

[-]

what do you do?

reply

upvote

by audience_mem2 hours ago|

[-]

He works on brain-melting stuff, the understanding of which is far beyond us.

reply

upvote

by jahsome7 hours ago|

[-]

Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

reply

upvote

by StephenHerlihyy6 hours ago|

[-]

What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.

reply

upvote

by Rudybega4 hours ago|

[-]

The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.

reply

upvote

by SatvikBeri6 hours ago|

[-]

I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test

reply

upvote

by clhodapp6 hours ago|

[-]

And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...

reply

upvote

by malshe7 hours ago|

[-]

This pretty accurately summarizes all the long discussions about AI models on HN.

reply

upvote

by cactusplant73746 hours ago|

[-]

Hourly occurrence on /r/codex. Model astrology is about the vibes.

reply

upvote

by wasmainiac7 hours ago|

[-]

[flagged]

reply

upvote

by nocman7 hours ago|

[-]

> Who are making these claims? script kiddies? sr devs? Altman?

AI agents, perhaps? :-D

reply

upvote

by locknitpicker7 hours ago|

[-]

> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?

You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.

reply

upvote

by andrepd6 hours ago|

[-]

It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.

reply

upvote

by BoredPositron7 hours ago|

[-]

When you keep his ramblings on twitter or company blog in mind I bet he is a shit poster here.

reply