upvote
Correct output is table stakes. Your test only shows that the products work as advertised, it doesn’t reveal reasons why people prefer one to the other.

You’re guessing that it’s a result of advertising, and I agree that that’s probably a component, but it’s a mistake to assume that they are interchangeable when you have people saying to you directly “I use both and they’re not.”

reply
deleted
reply
This is an incredibly silly comparison. It amounts to claiming that a Ford Pinto is just as good of a car as a Rolls-Royce by simply observing that both cars got a person from point A to point B. After all, once someone reaches their destination you can hardly tell what vehicle they actually used to get there, but that doesn't mean there's no difference between vehicles.

What matters most in state of the art models isn't simply the final destination, it's the process of how one arrives to that destination.

reply
I think your analogy makes the opposite case better. A Rolls-Royce and a Pinto have the same real commute time because horsepower isn't the bottleneck, and they both get passengers from point to point. Sure the Pinto explodes a bit but much like the actuaries at Ford, you might well judge the cost of an occasional explosion to be a trade-off you can easily compensate for.

I would argue the process these days has more to do with the harness than the model, at least when we're talking about the SOTA options. Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively. Almost all of the out-of-the-box tutorials and skills and frameworks are build for Claude first, then Codex maybe.

I'd go further and say that CC and Codex are not even the best harnesses available, they just offer the most subsidized rate plans.

reply
> Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively.

This. Never underestimate the ability of a large number of power users to substantially improve the actual utility of a complex software product.

They always have more time (and sometimes more skill) than a product's developers.

Sometimes the quantity of monkeys matters more than the quality of the typewriters.

reply
In my test the prompt was the same and all suggestions were auto accepted so indeed there was no difference other than model and harness. The amount of characters typed and interaction with the harnesses were exactly the same.
reply
To keep with the analogy, isn't that sort of like testing two cars by having them both drive the same few hundred foot stretch of new road at the posted speed limit of 35 MPH? You will test some things doing that, but not particularly well, and hardly all the things people find interesting and useful for comparing the performance of cars.

To bring ng this back to the discussion at hand (and to be redundant, as it's been mentioned here already), there are many aspects of using an LLM that are not purely about the output from a single or few well formed prompts. Additionally, if the end results are very similar, these othrr aspects will have an outsized influence on people's perspective of the tools, as they're the only differences worth choosing one model over another.

reply
If I were to give one carpenter a set of fine hand tools, another a full workshop with power tools, and they both made a picnic table to the same spec, and at the end I wasn't able to tell which came from which, would you say I have come to a fair metric for which type of tooling to use for wood working?
reply
If the effort was the same as was in my test, yes.
reply
The amount of effort is an absolutely critical comparison, right? That's been left out, yet you keep on harping about how the outputs are the same, ignoring all the many many many comments that are talking about the amount and type of effort.

In fact, after seeing all these comments about the amount of effort, you redirected at calling that mere "vibes:

> Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down

Which, again, is a highly emotional way to view people trying to say that the process matters too. Calling people "vibes based" or "highly susceptible to marketting" and saying they take part in "tupperware parties" rather than evaluating their experience with tools is quite a thing to see, a complete dismissal of professionals' core experience as "vibes" rather than something intrinsic to how they perform labor.

reply
Wouldn't the question be if they could tell the tables apart by quality (after insisting one of the two parties made things of superior quality)?
reply
To add on context, the experiment you're giving is called a *blind judging test*. Remove the branding and labels, and let judges sample the results and see if they can tell which is ranked correctly.

Some examples are blind wine tasting tests. There are instances whereby some journalists invited renowned/established wine tasters and subjected them to blind wine tasting tests. Turns out the judges couldn't tell which was which. Pretty embarrassing.

It speaks volumes as to how people can accurately judge the value of things. There is research by some network scientist that says you can't generally can't tell the 1% from the top, though you can tell the really bad from the generally good. What OP's experiment might tell us is that the LLM competitive advantage is so small no one can tell which is objectively better.

reply