upvote
You buy a wooden dinner table, it is fully functional and looks perfect. It’s sturdy. You have dinner on it and it survives a few spills.

A few months later you find out it is made of PU foam and printed waxed paper. A misplaced knee could bring it down. It’s likely to completely fall apart in a year. Is that irrelevant?

reply
Yes it is relevant and testable. It's exactly what I meant by "a measurable increase in quality of the final product". In fact a proper test harness would reveal that problem. You are forgetting that with LLMs, testing software does not have to end at the usual unit/integration/e2e level.
reply
But how is that testable? If your test is validating the rigidity, water resistance, etc, they will all pass even if the underlying material is a bad choice. Or the glue will degrade in six months.

You can't test if a codebase will be extensible or maintainable as requirements change in the future, if the abstraction level or architecture is sound - that's down to code quality measures like the ones used here. LLMs are very good at slightly cheating to pass tests even when the implementation is wrong. Introducing subjectivity - the kind of input a human will provide - leads to improved output.

https://senior-swe-bench.snorkel.ai/blog/2026-06-16-how-it-w...

reply
That's why we should simulate changing requirements, for example with an LLM roleplaying as a human who's co-developing with an agent. Simply asking the LLM to add one big feature is not enough. I don't see why we shouldn't be able to build a more advanced benchmark. Attempting to benchmark "taste" is not the way.
reply
I'll leave the conversation at the fact that it's painfully clear that you don't write software for a living.
reply
Yes, please do leave. The thing is that this isn't even necessarily about software engineering as much as it is about benchmarking/epistemology in general.
reply