upvote
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.

Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.

Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.

And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.

reply
You will immediately notice the difference if you use it at the threshold.

It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.

If you were to just watching them play, work out, shoot - you'd never notice the difference.

Put them head to head and it's 98-54 and you start to see the patterns.

It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.

Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.

Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.

Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.

reply
Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.
reply
No you don’t.

Everything you say is just vibes, what you want to see, your own subjective and retarded experience.

You are a retard.

reply
To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.
reply
The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.

I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.

And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.

reply
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.

Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.

reply
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
reply