One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.
For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.
That was last year. Is it happening with the frontier models?
If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.
If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.
To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".
It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.
The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.
But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.
Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.
Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..
I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it.
Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are very frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?
I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.
Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.
Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.
[1]: https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...
I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.
[1] https://arxiv.org/abs/2604.15597
Discussion: https://news.ycombinator.com/item?id=48073246
If there's a second thing the generative AI tools have shown beyond any doubt it's that many of the more modern (relatively speaking) "best practices" that have always been over-hyped and questionably-evidenced really do tend to produce worse results. LLMs take these methods to their logical conclusions and show us the end result much sooner. You can't just iterate your way to a solution when you don't even know what problem you're trying to solve. If you don't have a clear spec then you don't know what a correct product looks like. You need to invest time in reviewing code properly. If you don't keep the big picture in mind then the big picture becomes a mess.
Maybe one day the LLMs will leave me out of a job but at least I'll feel validated first!
tasks spanning eight web frameworks
Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?The most incredible combo I've seen lately is progressive enhancement of Razor Pages with javascript. With this arrangement the newest models tend to make a really good call on if something should happen server-side (cshtml) or on the client (js).
When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.
https://docs.python.org/3/library/typing.html
"The Python runtime does not enforce function and variable type annotations. They can be used by third party tools such as type checkers, IDEs, linters, etc."
Which third-party enforcement mechanism do you propose become the default?
There are many reasons for this. A big one is that many libraries are only partially typed at best, and dynamic types tend to propagate, weakening the guarantees you get from type checking.
Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking. Runtime metaprogramming is the same. All of these things have equivalents in a good statically checked language, but Python doesn't follow those models.
Fundamentally, in Python static typing is an optional analysis layer over a dynamic language, and the consequences of that can't be fully mitigated. The result is a big difference in what types can guarantee.
I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.
The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.
I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.
But they're also superhuman in so many other ways. It's valid to point out limitations, but that doesn't support the conclusion that models are not incredibly powerful and capable of the functional equivalent of reasoning at human or superhuman levels in many scenarios.
In our story, investors are mining intelligence from GPUs, and they truly believe they are one inch from discovering the biggest goldmine in history. But GPUs, unlike a goldmine, cannot be inspected for traces of gold by independent contractors. To keep the hype up, the nihilists in our story dig up cheap gold-looking metals from time to time and tell investors that with a bit of alchemy - agentic workflows, etc. - those metals can be magically turned into gold.
Investors will keep digging until the end of the age, or until they run out of money.
gpt 5.2 is the strongest model they tested, a nearly 6 month old model.
Traditional research can not keep up.