upvote
That's a fair callout and I agree my statement was too general in just mentioning 'output', as you correctly pointed out. To define 'better' you would indeed need to agree on the dimensions you would evaluate candidates against.

I think a more appropriate rephrasing would be 'You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference on dimensions you care about'. In the case of latest of claude code vs codex with gpt 5.5) both are similar enough in the dimensions people will care about in evaluating (vs. differing wildly in cost or time taken).

reply
This obviously correct take will get pushback, so let me add some other examples:

- which tool required more detailed goal-setting in the prompt?

- did one tool ask follow-up questions up front vs spread out over implementation?

- did either tool match existing coding styles?

- did either tool remind you about potential conflicts between what you asked it to build and other parts of the codebase?

There are a lot of ways to compare agents besides just the code. (Similarly, working engineers are not evaluated just on their code output.)

reply
The colleague implicitly agreed that comparing the output was a valid way to settle the matter as they took part in the test, so they weren't using "better" in the way you propose.
reply