upvote
Hmm in my experience (I've done a lot of head-to-heads), Opus 4.6 is a weaker reviewer than GPT 5.4 xhigh. 5.4 xhigh gives very deep, very high-signal reviews and catches serious bugs much more reliably. I think it's possible you're observing Opus 4.6's higher baseline acceptance rate instead of GPT 5.4's higher implementation quality bar.
reply
This is also my experience using both via Augment Code. Never understood what my colleagues see in Claude Opus, GPT plans/deep dives are miles ahead of what Opus produces - code comprehension, code architecture is unmatched really. I do use Sonnet for implementation/iteration speed after seeding context with GPT.
reply
I agree. Opus, forget the plan mode - even when using superpowers skill, leaves a lot of stuff dangling after so many review rounds.

Along with claude max, I have a chatgpt pro plan and I find it a life-saver to catch all the silliness opus spits out.

reply
I agree, I use codex 5.4 xhigh as my reviewer and it catches major issues with Opus 4.6 implementation plans. I'm pretty close to switching to codex because of how inconsistent claude code has become.
reply
Maybe it's all just anecdotal then. Everyone is having different experiences.

Maybe we're being A/B tested.

reply
The experience one has with this stuff is heavily influenced by overall load and uptime of Anthopic's inference infra itself. The publicly reported availability of the service is one 9, that says nothing of QoS SLO numbers, which I would guess are lower. It is impossible to have a consistent CX under these conditions.
reply
I have noticed this as well. I frequently have to tell it that we need to do the correct fix (and then describe it in detail) rather than the simple fix. And even then it continues trying to revert to the simple (and often incorrect) fix.
reply
You have to throw the context away at that point. I've experienced the same thing and I found that even when I apparently talk Claude into the better version it will silently include as many aspects of the quick fix as it thinks it can get away with.
reply
I have a similar workflow but I disagree with Codex/GPT-5.4 reviews being very useful. For example, in a lot of cases they suggest over-engineering by handling edge cases that won't realistically happen.
reply