For a developer using an LLM on a daily basis, the experience is about much more than just the resultant code.
There’s everything from:
- how often you had to manually steer the model
- how frequently you needed to course-correct
- how much detail you had to provide up front
- how was the interaction process (sycophantic, etc)
- how well did it handle MCP and external tooling?
- how effectively could it pull in additional information from external sources such as the web?
- how fast did it produce code?
- how much did it cost?
Many of my friends who are devs use things like OpenCode CLI with Openrouter because they switch between the various SOTA models so often. Just because you saw a Claude "meetup" doesn't prove anything other than somebody chose the name because it resonated more than "Generic LLM Meetup".
I flip between models all the time. Makes little difference. Sometimes one model is faster or better than another but there's no rhyme or reason why.
Actually there is a nice body of work by Steven Clarke on cognitive dimensions of notations/APIs and the interaction with developer personalities.
I wonder if the same holds for AI models and harnesses.
Surely this is just to the random nature of these stochastic parrots?
Do you mean you have identified a class of problems Claude always stalls on and another class of problems Codex always stalls on? What identifies these different classes of problems you see? How would you say Claude is stronger than Codex and vice versa? Why?
You can go back and forth and compare since you pay for both subscriptions, but is that a usual case? I'd guess most developers picked one in 2025 and haven't gone back. Just like most people just pick a bank for their checking account and never change it.
As for the test, of course the output matters. Take image models for example. Differences are clear as day.
Should the fact that OpenAI existed before Anthropic did at all matter? No, imo. I would have used opus 4.8, but it only just came out- fast moving space
You’re guessing that it’s a result of advertising, and I agree that that’s probably a component, but it’s a mistake to assume that they are interchangeable when you have people saying to you directly “I use both and they’re not.”
What matters most in state of the art models isn't simply the final destination, it's the process of how one arrives to that destination.
I would argue the process these days has more to do with the harness than the model, at least when we're talking about the SOTA options. Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively. Almost all of the out-of-the-box tutorials and skills and frameworks are build for Claude first, then Codex maybe.
I'd go further and say that CC and Codex are not even the best harnesses available, they just offer the most subsidized rate plans.
This. Never underestimate the ability of a large number of power users to substantially improve the actual utility of a complex software product.
They always have more time (and sometimes more skill) than a product's developers.
Sometimes the quantity of monkeys matters more than the quality of the typewriters.
In fact, after seeing all these comments about the amount of effort, you redirected at calling that mere "vibes:
> Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down
Which, again, is a highly emotional way to view people trying to say that the process matters too. Calling people "vibes based" or "highly susceptible to marketting" and saying they take part in "tupperware parties" rather than evaluating their experience with tools is quite a thing to see, a complete dismissal of professionals' core experience as "vibes" rather than something intrinsic to how they perform labor.
Some examples are blind wine tasting tests. There are instances whereby some journalists invited renowned/established wine tasters and subjected them to blind wine tasting tests. Turns out the judges couldn't tell which was which. Pretty embarrassing.
It speaks volumes as to how people can accurately judge the value of things. There is research by some network scientist that says you can't generally can't tell the 1% from the top, though you can tell the really bad from the generally good. What OP's experiment might tell us is that the LLM competitive advantage is so small no one can tell which is objectively better.
It’s a known “secret” for a while now how much better Codex is than Claude. I’ve used both since they were released and I often implement in both to compare and 95% of the time Codex writes better code and also less code!
Claude is only really better at front end design.