undefined

points

[-]

In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.

It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.

Video: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...

We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.

by jeswin5 hours ago|

prev|

[-]

When it comes to lengthy non-trivial work, codex is much better but also slower.

by 5 hours ago|

prev|

[-]

deleted

by surgical_fire6 hours ago|

prev|

[-]

I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).

If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.

by hnsr3 hours ago|

parent|

[-]

> I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.

I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things

by joquarky2 hours ago|

parent|

[-]

> overly specific

I have a hypothesis that people who have patience and reasonably well-developed written language skills will scratch their heads at why everyone else is having so much difficulty.

by simianwords6 hours ago|

parent|

prev|

[-]

No my question was why would I use codex over gpt 5.4

by surgical_fire6 hours ago|

parent|

[-]

Ahh, good question. I misunderstood you, apologies.

There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?

Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?

by landtuna5 hours ago|

parent|

[-]

5.3 Codex is $1.75/$14, and 5.4 is $2.50/$15.

by surgical_fire3 hours ago|

parent|

[-]

There you go. It makes perfect sense to keep it around then.

by athrowaway3z5 hours ago|

parent|

prev|

[-]

They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.

I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.

Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.

You could add more scaffolding to fix this, but Claude proves you shouldn't have to.

I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.

by surgical_fire4 hours ago|

parent|

[-]

> They perform at a somewhat equal level on writing single files.

That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.

I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).

by synergy204 hours ago|

prev|

[-]

in my testing codex actually planned worse than claude but coded better once the plan is set, and faster. it is also excellent to cross check claude's work, always finding great weakness each time.

by pmarreck3 hours ago|

parent|

[-]

That’s why I think the sweet spot is to write up plans with Claude and then execute them with Codex

by GorbachevyChase2 hours ago|

parent|

[-]

Weird. It used to be the opposite. My own experience is that Claude’s behind-the-scenes support is a differentiator for supporting office work. It handles documents, spreadsheets and such much better than anyone else (presumably with server side scripts). Codex feels a bit smarter, but it inserts a lot of checkpoints to keep from running too long. Claude will run a plan to the end, but the token limits have become so small in the last couple months that the $20 pla basically only buys one significant task per day. The iOS app is what makes me keep the subscription.

by joquarky2 hours ago|

parent|

prev|

[-]

And it fits well with the $20 plans for each since Codex seems to provide about 7-8x more usage than Claude.

by embedding-shape6 hours ago|

prev|

[-]

Why would someone use Claude Code instead? Or any other harness? Or why only use one?

My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.