upvote
Yup, OP is conflating so many things that the comparison has all the scientific rigor of the Pepsi Challenge.

For a developer using an LLM on a daily basis, the experience is about much more than just the resultant code.

There’s everything from:

- how often you had to manually steer the model

- how frequently you needed to course-correct

- how much detail you had to provide up front

- how was the interaction process (sycophantic, etc)

- how well did it handle MCP and external tooling?

- how effectively could it pull in additional information from external sources such as the web?

- how fast did it produce code?

- how much did it cost?

Many of my friends who are devs use things like OpenCode CLI with Openrouter because they switch between the various SOTA models so often. Just because you saw a Claude "meetup" doesn't prove anything other than somebody chose the name because it resonated more than "Generic LLM Meetup".

reply
the answer is : I usually let it do its thing with bypass permission and I run the max plan so nothing really matters except the "result". I think Claude is faster and has better UX integration with vscode but I wouldn't use it without GPT 5.5 XHigh as reviewer.Claude is just sloppy. Eventually I think it will not matter much in 1 -2 years. Most AI models will be good enough for most tasks so you may need the best of the best only if you do very complex stuff (i.e. optimizations etc)
reply
I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

I flip between models all the time. Makes little difference. Sometimes one model is faster or better than another but there's no rhyme or reason why.

reply
All tools are non-deterministic on some reasonably specified input set.
reply
Actually it would be fun to try to test the developer personality of the models.

Actually there is a nice body of work by Steven Clarke on cognitive dimensions of notations/APIs and the interaction with developer personalities.

I wonder if the same holds for AI models and harnesses.

reply
> Some times one will spin for a long time on certain problems where the other has no problem finding the appropriate parts of the codebase and getting an efficient solution.

Surely this is just to the random nature of these stochastic parrots?

Do you mean you have identified a class of problems Claude always stalls on and another class of problems Codex always stalls on? What identifies these different classes of problems you see? How would you say Claude is stronger than Codex and vice versa? Why?

reply
Kind of orthogonal to the discussion, but could you broadly describe the code you're working on that both models are bad at? One thing I'm still struggling with is figuring out what types of code LLMs can vs cannot write.
reply
C code formally proven correct with Frama-C WP has been... marginal. The models do better than I expected at the proof portion (with ChatGPT 5.5 seeming to have a meaningful lead), but they all have a hard time (a) writing really good C code to begin with and (b) with compliance around not modifying C code semantics or performance as a cheat to simplify proof obligations. They also tend to be insanely and consistently verbose on the first proof pass... e.g. 8 lines of C code might end up at 200+ lines annotated and proven, but after simplification passes end up at 40 lines. I find I spend 90%+ of tokens on those simplification passes, and haven't really found a way to avoid the over-annotate-and-then-optimize tides by being a bit more sane the first time around.
reply
I think the subscription pricing model kind of incentivizes developers (at least hobby developers) to pick one and go all in on it. For someone who has probably never paid $20/mo for a piece of software in their life, $20/mo is kind of a big commitment, and the pay-per-token schemes are reportedly much more expensive for the equivalent blob of coding they enable. So you "pick one," plonk down the $20, and use it as much as you can in the month so it's worth it. If you want to try the other one, you don't renew next month, and plonk down another $20 for the other one.

You can go back and forth and compare since you pay for both subscriptions, but is that a usual case? I'd guess most developers picked one in 2025 and haven't gone back. Just like most people just pick a bank for their checking account and never change it.

reply
I am not sure why the past matters here. I am talking about now, it is a fast moving space.

As for the test, of course the output matters. Take image models for example. Differences are clear as day.

Should the fact that OpenAI existed before Anthropic did at all matter? No, imo. I would have used opus 4.8, but it only just came out- fast moving space

reply
Correct output is table stakes. Your test only shows that the products work as advertised, it doesn’t reveal reasons why people prefer one to the other.

You’re guessing that it’s a result of advertising, and I agree that that’s probably a component, but it’s a mistake to assume that they are interchangeable when you have people saying to you directly “I use both and they’re not.”

reply
deleted
reply
This is an incredibly silly comparison. It amounts to claiming that a Ford Pinto is just as good of a car as a Rolls-Royce by simply observing that both cars got a person from point A to point B. After all, once someone reaches their destination you can hardly tell what vehicle they actually used to get there, but that doesn't mean there's no difference between vehicles.

What matters most in state of the art models isn't simply the final destination, it's the process of how one arrives to that destination.

reply
I think your analogy makes the opposite case better. A Rolls-Royce and a Pinto have the same real commute time because horsepower isn't the bottleneck, and they both get passengers from point to point. Sure the Pinto explodes a bit but much like the actuaries at Ford, you might well judge the cost of an occasional explosion to be a trade-off you can easily compensate for.

I would argue the process these days has more to do with the harness than the model, at least when we're talking about the SOTA options. Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively. Almost all of the out-of-the-box tutorials and skills and frameworks are build for Claude first, then Codex maybe.

I'd go further and say that CC and Codex are not even the best harnesses available, they just offer the most subsidized rate plans.

reply
> Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively.

This. Never underestimate the ability of a large number of power users to substantially improve the actual utility of a complex software product.

They always have more time (and sometimes more skill) than a product's developers.

Sometimes the quantity of monkeys matters more than the quality of the typewriters.

reply
In my test the prompt was the same and all suggestions were auto accepted so indeed there was no difference other than model and harness. The amount of characters type and interaction with the harnesses were exactly the same.
reply
If I were to give one carpenter a set of fine hand tools, another a full workshop with power tools, and they both made a picnic table to the same spec, and at the end I wasn't able to tell which came from which, would you say I have come to a fair metric for which type of tooling to use for wood working?
reply
If the effort was the same as was in my test, yes.
reply
The amount of effort is an absolutely critical comparison, right? That's been left out, yet you keep on harping about how the outputs are the same, ignoring all the many many many comments that are talking about the amount and type of effort.

In fact, after seeing all these comments about the amount of effort, you redirected at calling that mere "vibes:

> Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down

Which, again, is a highly emotional way to view people trying to say that the process matters too. Calling people "vibes based" or "highly susceptible to marketting" and saying they take part in "tupperware parties" rather than evaluating their experience with tools is quite a thing to see, a complete dismissal of professionals' core experience as "vibes" rather than something intrinsic to how they perform labor.

reply
Wouldn't the question be if they could tell the tables apart by quality (after insisting one of the two parties made things of superior quality)?
reply
To add on context, the experiment you're giving is called a *blind judging test*. Remove the branding and labels, and let judges sample the results and see if they can tell which is ranked correctly.

Some examples are blind wine tasting tests. There are instances whereby some journalists invited renowned/established wine tasters and subjected them to blind wine tasting tests. Turns out the judges couldn't tell which was which. Pretty embarrassing.

It speaks volumes as to how people can accurately judge the value of things. There is research by some network scientist that says you can't generally can't tell the 1% from the top, though you can tell the really bad from the generally good. What OP's experiment might tell us is that the LLM competitive advantage is so small no one can tell which is objectively better.

reply
Exactly. Popular opinion is behind reality by several months. Claude used to be significantly better, now it is basically the same.
reply
Claude has been behind since GPT-5. Claude Code just looked cool and had better marketing
reply
Claude is more reliable in production, less errors and better understand instructions. That's why the valuation shifted, technical people are choosing Claude for actual shipped products.
reply
You must not be talking about the Anthropic endpoint…
reply
This isn’t the case at all, the most technical and best engineers are all using Codex now and have been for roughly six months.

It’s a known “secret” for a while now how much better Codex is than Claude. I’ve used both since they were released and I often implement in both to compare and 95% of the time Codex writes better code and also less code!

Claude is only really better at front end design.

reply
How could you possibly know all the most "technical and best engineers?". Wait.. are you a codex instance?
reply
deleted
reply
You're being silly. The actual technical people are using Claude for implementation and relying on MCP servers to use Codex 5.5 and Gemini 3.1 pro to build teams, councils, and long running senior engineer conversations within Claude to handle the technical bits that're too complicated for Claude.
reply