> A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is"
All of your suggestions are better but they're hard, so someone casually evaluating an AI isn't going to do them.
And so on and so forth. Again, I'm not saying this is impossible but I am saying that if you tried to do it, and you got the money, and you built the test, and got the human subjects clearance, and you ignored that during the process of all that at least one more frontier model would come out, you can count on HN anklebiting your "rigorous" study even so, and probably being correct about a lot of the issues it could have because it would take several iterations of this to build a reasonable protocol... at which point it would quite possibly also be obsoleted by progress again.
[1] https://blog.neurips.cc/2025/09/30/reflecting-on-the-2025-re...
There are far more opportunities that can be served when the world's intellectuals have the raw weights and can fine tune, splice, distill, and reapply.
Imagine having raw unfettered access to Fable. It can be refit to structural biology. It can be fine tuned on the repo for smaller context requirements. It can be run cheaper and air gapped.
The world wants this.
I think we are leaving the main frame era of AI and entering the PC era already. If there wasn’t a RAM shortage and we all had 2TB of ram and GPUs we would all have large local models or personal APIs serving our teams.
That’s why all the labs are moving to the App layer and moving away from being the API for intelligence like they were originally.
That said, maybe we just disagree on how to drive change, and that’s fine. I’ll leave it.
Compare that to Gemini models, which have impressive fluid intelligence on the first response, but fail to call tools or explore correctly which limits their usefulness for agentic coding.
Neither will be great for coding in a computational chemistry repo for different reasons, but the model with strong one-shot performance will be less likely to make subtle errors indicative of poor understanding, so we weight both capabilities into their final score.
The latest Anthropic and OpenAI models excel in both domains.
Data at https://gertlabs.com/rankings
It's the "starting from empty slate" greenfield that's the real problem.
We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.
One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.
That issue, and the issue of "aesthetics", are the biggest complaints I have today. I don't know exactly how to define aesthetics, but it's when AI is making decisions that no experienced developer or designer would. They may be functionally correct but "ugly" to another developer or and end user.
An example is an case I ran in to yesterday where parsing a config, and failing and logging on a configuration error. It logged a specific item where the config was invalid but not what group or any notion of where in the config this error was. Of course, specific item names could be duplicated in different parts of the config. It's small, but correcting these minor things take time and they are the types of decisions no one would have made who had any experience writing code and debugging a config problem. This was Opus 4.8/max too.
* SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823
Note that after the model generated a bunch of (intermediary) code, they still have to have it tested and get bugs fixed (via the agent/harness). In this "one shot" you still have agent loops against human defined objectives.
And these toy examples give some insight as to how the model performs. If the test were "here's some code written by $corp, please take these tickets and work on them" it may be a "real" example but nobody would be able to make sense of actually how "hard" it is, or how "well" the model did the job, besides the workers already familiar with the context.
At least everyone knows what a 3D game is.
I think however that they should have used the same harness and also repeated the experiment a few times to judge the variance in results.
What about: take top 3 feature requests, top 3 bug reports for 3 popular open source projects and ask to solve those based on the issue contents and access to the project repos.
Even if you stay in a single prompt scenario, you could make it more realistic.
Right, model intelligence defines the scope of things they can one shot
I also suspect that users naturally calibrate to a model's useful scope, gradually getting positive/negative feedback and gradually making their requests bigger/smaller than before
Similar to how ML was all the hype about 12 years ago and then it submerged again for a couple of years.
One can hope. Probably an unpopular take here but I'm tired boss.
The software world has a huge backlog of things that can all be done with the tech we currently have, no breakthrough advancements needed, but none of it will get prioritized when we're all forced to run on the new and shiny treadmill. Ever since LLM hype its like the javascript culture of a new framework every 10 minutes has infected every other vertical of software development and I'm exhausted.
But for a more practical issue, the ultimate goal of LLMs is to replace software engineers, or at least enable everybody to become a software engineer, to use a more up-beat phrasing that's no less accurate. And so an LLM's ability to reliably construct something from a poorly defined, contradictory, or otherwise flawed prompt, while accurately inferring intent is probably the first finish line.
- Vibes are too subjective, I want an actual A/B test!
- An A/B test is too limited, I want a benchmark! (You are here.)
- Those benchmarks never seem to be reliable, I just go on vibes.
In fact, I'd rather see Anthropic publish a convincing project that does this using Claude. The project should be complex enough and novel enough to show the world how reliable and powerful Claude is. That is, Anthropic does not need Amodei or its employees to tell us that whatever percent of engineers will lose their jobs. They can just show us. Easily.
I was using cursor, in large part because I could at least stop it when I need to.
I ended up building my own IDE from scratch so I can be more in the loop while also having the full agent experience.
Guardrails/conventions should be enforced in linters, formatters, static analysis tooling; not specs/prompts.
Elixir is where I prefer to build software, so it would be creating a custom Credo rule.
https://developers.openai.com/api/docs/guides/structured-out...
Nothing else operates on the logprobs level and literally bans continuations that fail your schema.
I agree generating millions of tokens from a handful of input tokens doesn't convey anything meaningful to me.
It's a relatively objective way of testing LLMs, and I think it's pretty representative of how strong models are overall.
The outcome of this test mirrors how GLM 5.2 and Opus 4.8 work for me: they're both similarly capable of fully executing a given task, but Opus tends to have a bit more "taste" in how it handles unstated details or implicit requirements.
> what you'll get is a series of assumptions made by the model
Yes, but that's why we use these models in the first place. We don't want to explicitly write down all the details because that would mean writing code. So we write a higher-level, human-language spec, and let the LLM fill in the blanks. The question is how good they are at doing that.
Of course, with a software engineer at the helm - the models are going to be able to be guided to produce much better output. (Or worse, depending on the engineer!)
To really evaluate how a model is to use in real life, it should have access to tools, and be able to iterate on something, like they do when you use them in an agent harness.
None of that iteration need necessarily to have a human driving it (although if you're building something you want to be able to maintain, you probably need a human driving the design and architecture), you can just let the model do a couple of tries and give it input into how it's doing, and you get something closer to how people use these models in reality.
This is the wrong metric to target. Today's models can feel one-shot but they are so at the expense of resilient ReAct loops that brute force their way out of the mess initial prompts created.
And each iteration is expensive.
Sometimes failing fast and early is better than going for one-shot models that try to mitigate the mess they created with reasoning steps and ReAct loops.
Additionally, with "Hey build X" nobody is happy with the methodology and people rightfully complain about the set up.
Using your suggestion the methodology would require a lot of presumptions & arguments regarding why you choose it and think it relevant to people.
Either people would not "get" it quickly enough or would disagree/not be interested on the setup because its not how they use LLMs.
On another, being able to reliably tackle minor tasks with no handholding is very valuable in itself. Sometimes implementation details are important, but often, the most important thing is to Get It Done.
Current models aren't capable of that, but that doesn't mean it's not possible.
If you made models able to code to long spec, you would be left with the hard issue of having to write them.
Like if you show the LLM a page, can the LLM review the page and then spit out a review that is close to what a human would say about the page?
Software was always that way, though.
> given the sufficiently smart compiler
For those unaware, this is a similar quote used by compiler proponents. The first full compiler was created in 1957 (+/- 70 years ago) and the "sufficiently smart compiler" never happened, hand written code from the best coders still is faster. Now, that doesn't mean that compilers didn't do the job well enough, we just accepted that 90-95% of the top speed was enough for almost everything.
To the LLM one shotting point, it took 30 (40?) years for compilers to be good enough for the mass market. Caveat early adopter and investor.
Plus what pyrale said.
The agentic engineering paradigm is just a narrative trend pushed by AI companies to get people to 10x their token consumption per prompt. It plays into people's laziness and addiction to dopamine too causing addict like behavior in people that fall prey to this trend.
If I do that, I'm literally slower then just doing the change without sufficiently specifying it to the model.
I can see how a junior dev or generally someone that's not particularly knowledgeable about the language or framework they're working with may benefit from such usage, but for experienced people there is very little value in that approach.
I say this because I've just had to face this decision this month with Copilot introducing the usage based billing. I attempted to scale back my usage, first with non-opus - output essentially became discardable as it continually hallucinated no existing fields in the responses of Apis etc... Then my scoping the changes smaller and smaller, until I ultimately gave up and reduced usage to just generating tests.
What is tested often makes no sense at all, completely implausible edge cases are tested on internals, while it doesn't create tests for the overall application using user events.
And some things in these test cases are downright ridiculous: instead of instantiating your classes, it sets up some barebones fake objects reimplementing some of the behavior of your actual class, then ignores the TypeScript errors via force cast or similar.
Then it proceeds to slap some test ids on the output, stubs components and dependencies more or less randomly, adds some assertions on test ids and calls it a day.
Apparently that's good enough for many colleagues to open a MR for that garbage.
That said, at home with SOTA models I happily hand large units of work to it, outsource much of the thinking, and get workable results. I think this is the future.
I see little value in throwing a ton of context at an llm and waiting 10-20 minutes for a coin flip on whether or not its going to produce junk. I'd rather do quick 60 second turns, get most of the way there and fix the rest myself if I have to. I'd rather honestly just not use them.
Everyone that I've ever interacted with and claims to prompt in "seconds" actually needs multiple minutes to think about the solution they want the model to implement - and then need twice as long to formulate that into a sentence which provides the model enough context to actually do that
So the more realistic estimates are "I'd rather spend the 2 minutes just implementing the minor change myself, instead of spending 1.5 minutes thinking about it, then 2.5 minutes writing the prompt and then waiting 1 minute for it to finish"
That's the main value I've been getting out of coding agents. I have them do (comparatively) simpler tasks or explorative tasks in the background while I'm in a meeting, doing code reviews, or otherwise working on something else.
The top agent is for steering, but all subagents are mostly oneshot prompts
"Well obviously you provided better follow-up prompts to the one that came out better."
Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response?
The business guy would say "hey build me this and that" and would get _something_ to show of.
An engineer will have a long conversation with a llm about the exact requirements, tech stack, tradeoffs. He would understand what is built, how is it built, and refine on the fly until he gets something sensible.
It won't be as fast as "build this", but the result will be much better and more maintainable.
For the enginering workflow, you don't need Fable. Any model better or equivqlent to Sonnet 4.6 would do. Yes, sometimes it will hallucinate, sometimes it'll be wrong, but it's our job as engineers to correct it and have full ownership of the result.
And yet, even the smartest AI in the world would give an alternative solution every time you invoke it. And you still need someone to judge what is right and what is not.
The reality is: - business rules change - ideas for improvement may arise from the initial prompt - updates to submodules/functions/configs/secrets are BLOCKERS ... etc.
One shot prompting for the expecations of complete software is seemingly more and more a show of incompetence of the use of this technology. It's like trying to make my toddler eat a ham sandwich from the peanut butter & jelly I put in front of him.
If we stopped developing LLMs the the only reasonable way to benchmark them would be to compare yheir performance with all the tricks we can build on top of them. Sine the are still developing rapidly any apples to apples comparison is worthwhile.
Of course this particular benchmark is not really single prompt but rather "agentic without steering".
Instruction following has been down for years, and while there are of course metrics that continue to improve as the frontier advances (for example, the ability to continue following the original instructions even as context grows), you can't really get that much better at performing a list of instructions as-written if the instructions are sufficiently precise enough that there's no wiggle room for interpretation (which seems to be what you are describing).
For example, one of the things that got me the most excited for Fable 5 was its ability to work for over eight hours straight on a single instruction and seemingly faithfully the entire time. That was something I observed personally after trying out the same workflow that runs for maybe two or three hours with Opus and then still needs followups. Fable needed no followups. That's a game changer for me compared to the prior state of the art.
That kind of stuff is going to end up being the most beneficial to people who are touching the edges of their knowledge or even exploring completely new areas. And that type of work is exactly the kind of work that makes agentic coding so powerful, even as much as it gets harder to judge the quality of the work when you lack the skills yourself. It's a good thing that the quality increases across the board, even for skilled practitioners.
For example, even people who know how to write inference engines or how matmul kernels work or how to optimize model architecture can't always predict just the sheer breadth of things agents can try to improve performance, and sometimes you get over some wall and reach a completely different optimum that you just wouldn't have reached in any reasonable amount of time by applying traditional knowledge even if you're an expert in the field.
That kind of stuff is amazing. And that's exactly the kind of stuff that one-shot prompting is testing for. It's kind of like testing for the model's "innovation", as much of an oxymoron that is.
Since Opus 4.6 I've seen later Anthropic models being more and more capable on one hand, but also less useful on multi turn open tasks.
It feels like with each model they are more and more prone to go "their own way" and jump into the implementation as soon as they can.
I can't but blame it on benchmarks and fine tuning around prompt-to-solution work.