Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

upvote

Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

(arxiv.org)

99 points

by wek6 hours ago |

upvote

by jdlshore4 hours ago|

[-]

“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”

One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.

reply

upvote

by Animats2 minutes ago|

[-]

That may be the same problem seen when prompts try to force "alignment" or "guardrails". There's a performance drop. Seemingly, a big chunk of the potential solution space has been made unreachable.

For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.

That was last year. Is it happening with the frontier models?

reply

upvote

by qsort2 hours ago|

[-]

I think it's downstream of "you can't optimize for two different objectives".

If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.

If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.

reply

upvote

by apsurd2 hours ago|

[-]

Would you mind sharing antirez' suggestion?

reply

upvote

by qsort2 hours ago|

[-]

I am obviously paraphrasing, but the general idea is that trying to synthesize style from a codebase into e.g. a markdown guide generally doesn't work very well. What achieves style transfer is providing the model with a lot of examples of the style, conventions, patterns you want.

To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".

reply

upvote

by brandensilva1 hours ago|

[-]

Right more simply put it's great at being a copy cat, exploring similar data points that match your token needs.

It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.

The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.

But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.

reply

upvote

by mikeyouse1 hours ago|

[-]

I ran into similar issues as we started to roll out LLM generated financials in our org.. I’m so used to the old SQL workflow of “grab this data from this table, that data from that table, combine it into a final result that looks like xxxx” where the tables were outputs from reports in our ERP but I was having terrible results.

Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.

Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..

reply

upvote

by BlueTierOps24 minutes ago|

[-]

[flagged]

reply

upvote

by KaiShips3 minutes ago|

[-]

[flagged]

reply

upvote

by nijave2 hours ago|

[-]

Hmm, I have some anecdotal evidence this is true. Interactively working out a plan with Opus on multiple occasions it'd come up with an incompatible solution, I'll add additional context/requirements, and it has a tendency to "anchor" on it's original architecture and struggles to adapt. Sometimes it tries to sneak in changes for the original plan anyway.

reply

upvote

by whstl1 hours ago|

[-]

Opus does this waaaay too much for my taste. It works fine for vibe-coders but for technical work it is infuriating.

reply

upvote

by UncleEntity8 minutes ago|

[-]

I think the problem is they take the shortest path to the goal ...which may or may not coincide with what you have planned. Oh, and generally think instructions are merely suggestions and what you really want this this totally different thing and not the one in the plan you handed them plus, as a stoke of good luck, this other system is a lot easier to implement as well.

I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it.

Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are very frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?

reply

upvote

by jeremyjh2 hours ago|

[-]

Even the strongest frontier model they used - GPT 5.2 - I would consider barely usable for agentic programming.

I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.

Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.

reply

upvote

by sigbottle2 hours ago|

[-]

Wait isn't gpt 5.2 good? Or is it not thinking / not codex? 5.2 was what sparked the late 2025 openai agentic programming revolution.

reply

upvote

by xienze2 hours ago|

[-]

> their performance drops when forced to navigate explicit architectural rules

Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.

reply

upvote

by vishvananda1 hours ago|

[-]

I've been experimenting quite a bit with long-horizion agentic coding[1] and I have also noticed that agents seem to perform worse when forced into certain architectural patterns. I have found that is a bit better when including the constraints along the way instead of adding them after the fact. There seems to be a side-effect I have been calling "calcification", where a pattern starts appearing in the codebase and the agent follows the pattern to the point where it dominates the context and becomes self-reinforcing. This could potentially be a strength or a weakness for existing code bases depending the codebase quality. I will have more insights on this soon as more from-scratch runs conclude that include architectural guidance from the beginning.

[1]: https://medium.com/@vishvananda/i-spent-2-billion-tokens-wri...

reply

upvote

by maxbond4 hours ago|

[-]

Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.

I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.

[1] https://arxiv.org/abs/2604.15597

Discussion: https://news.ycombinator.com/item?id=48073246

reply

upvote

by emp173443 hours ago|

[-]

If it’s not easily verifiable, LLMs aren’t good at it.

reply

upvote

by jeremyjh3 hours ago|

[-]

I think that’s mostly because they get so much more of that reinforcement learning - since it is so economical. I dont know if there is any evidence of a fundamental reason they can’t be just as good at other tasks, but it might be economically infeasible for awhile yet.

reply

upvote

by mjburgess2 hours ago|

[-]

No one is curating vast amounts of data for them in other domains. Programmers send programs with fixes

reply

upvote

by knollimar1 hours ago|

[-]

There's no diff of my excel lambdas being fixed? :(

reply

upvote

by emp1734446 minutes ago|

[-]

RLVR doesn’t work for unverifiable tasks, so they won’t be able to effectively use tools to boost reliability for those tasks.

reply

upvote

by dwa35923 hours ago|

[-]

This sounds like another version of "As a chat becomes longer, the guardrails seem to become fuzzy". You can't use all of the context window bc at the end, the output would not respect the constraints (or guardrails) but to reliably produce production grade code you want the model to have expansive awareness which fills up the context window pretty quickly. It's like saying "Keep everything in mind from these 6 directories - and make this <insert ticket> change" - but keeping everything in mind already fills it's context window which makes it lose it's ability to follow the constraints (or guardrails).

reply

upvote

by whatever13 hours ago|

[-]

This is not a new problem though. This is why we started writing modular code, strict interfaces etc

reply

upvote

by lanstin2 hours ago|

[-]

And doing incremental dev, so once a feature is done you can mostly ignore it.

reply

upvote

by Silhouette2 hours ago|

[-]

If there is one good thing that the generative AI tools have shown beyond any doubt it's that the classic "good programming" practices are still useful and effective. Self-documenting code. Modular design. Clearly defined architecture. Incremental development. Coding standards. Automated tests. Automated everything.

If there's a second thing the generative AI tools have shown beyond any doubt it's that many of the more modern (relatively speaking) "best practices" that have always been over-hyped and questionably-evidenced really do tend to produce worse results. LLMs take these methods to their logical conclusions and show us the end result much sooner. You can't just iterate your way to a solution when you don't even know what problem you're trying to solve. If you don't have a clear spec then you don't know what a correct product looks like. You need to invest time in reviewing code properly. If you don't keep the big picture in mind then the big picture becomes a mess.

Maybe one day the LLMs will leave me out of a job but at least I'll feel validated first!

reply

upvote

by p0w3n3d3 hours ago|

[-]

   tasks spanning eight web frameworks

Does anyone else have this experience that LLM create better pure html+CSS+js than work with existing frameworks?

reply

upvote

by bob10292 hours ago|

[-]

I think web frameworks have been "in trouble" as of gpt-5.4. I can't imagine using something like React anymore.

The most incredible combo I've seen lately is progressive enhancement of Razor Pages with javascript. With this arrangement the newest models tend to make a really good call on if something should happen server-side (cshtml) or on the client (js).

reply

upvote

by yomismoaqui3 hours ago|

[-]

Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.

When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.

reply

upvote

by acbart3 hours ago|

[-]

It's crazy to me that people think of Python as dynamically typed by default. Strong static typing has been an option in Python for years now, and it should just be the default.

reply

upvote

by epgui3 hours ago|

[-]

The python type hints are useful for static analysis (and yes, should be the default) but it’s a joke compared to the utility of types in a language like Haskell.

reply

upvote

by mrob1 hours ago|

[-]

>Strong static typing has been an option in Python for years now, and it should just be the default.

https://docs.python.org/3/library/typing.html

"The Python runtime does not enforce function and variable type annotations. They can be used by third party tools such as type checkers, IDEs, linters, etc."

Which third-party enforcement mechanism do you propose become the default?

reply

upvote

by antonvs31 minutes ago|

[-]

Typing with tools like Pyright doesn't come close to providing what a good statically typechecked language provides.

There are many reasons for this. A big one is that many libraries are only partially typed at best, and dynamic types tend to propagate, weakening the guarantees you get from type checking.

Dynamic idioms in general, including something as common as string-indexed dictionaries, negate type checking. Runtime metaprogramming is the same. All of these things have equivalents in a good statically checked language, but Python doesn't follow those models.

Fundamentally, in Python static typing is an optional analysis layer over a dynamic language, and the consequences of that can't be fully mitigated. The result is a big difference in what types can guarantee.

reply

upvote

by bob10293 hours ago|

[-]

> Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline.

I have exactly the inverse findings on my end. The bigger and more legacy the codebase, the more accurate the patches become.

The harness itself seems to be the most important part. I use a recursive loop that primes the root context based on the user prompt each time. My agent will often make over 100 tool calls to sql and git before it finally decides to apply a patch. If I was greenfield, there would be nothing to query or constrain against.

reply

upvote

by richardlblair3 hours ago|

[-]

I find the same. We have abstractions with multiple concrete implementations, examples of patterns and examples of anti patterns.

I usually find I can achieve 90% of the outcome I'm trying to achieve. I use sonnet for planning, qwen for coding, sonnet for review.

reply

upvote

by xcjsam2 hours ago|

[-]

The harness mattering more than the model lines up with my experience too. What this paper measures is within-turn constraint decay. The version that bites in multi-agent setups is across-session — the architectural rules an agent wrote down on Monday don't reach the agent making the next change on Tuesday.

reply

upvote

by gkfasdfasdf4 hours ago|

[-]

Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.

reply

upvote

by maleldil55 minutes ago|

[-]

Considering this is from academia, there's a chance there were limitations on the available models. My research group accesses OpenAI models via Azure, and until recently (last week) the latest model was GPT 5. We just got 5.4.

reply

upvote

by leecommamichael3 hours ago|

[-]

These things don’t think. We’re going to have to reiterate this for a long time, I fear.

reply

upvote

by emp173443 hours ago|

[-]

There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.

reply

upvote

by suprfnk1 hours ago|

[-]

I don't think they think. I still use them a lot despite that, because they are very powerful parameterised code generators.

reply

upvote

by sheeshkebab3 hours ago|

[-]

…but they reason well enough given enough context (using their matmuls).

reply

upvote

by noosphr3 hours ago|

[-]

To this day frontier models think that A and not B means A and B when the sentence gets pushed far enough back in their context window. The context length that model can reason over without obvious errors is much smaller than the advertised context. Between a 1/4th to a 1/20th what is advertised on the tin.

reply

upvote

by antonvs27 minutes ago|

[-]

Critiques like this tend to focus very hard on what models can't do. It's true, they have limitations.

But they're also superhuman in so many other ways. It's valid to point out limitations, but that doesn't support the conclusion that models are not incredibly powerful and capable of the functional equivalent of reasoning at human or superhuman levels in many scenarios.

reply

upvote

by Npovview2 hours ago|

[-]

Do you also happen to remember what you ate last thrusday?

reply

upvote

by leecommamichael2 hours ago|

[-]

Is that the same gap as what you’re responding to? To me, it seems his critique is about advertised capability and logical statements, and your rhetorical(?) question is about memory.

reply

upvote

by akomtu36 minutes ago|

[-]

There is a movie, Gold (2016), about a fake gold mine. One of its founders is a true believer: he found a few chunks of gold and started digging for more. The other founder is a nihilist: he realised that there is no gold there, but who cares if he makes the investors believe? So he does, and almost sells the company for $300M.

In our story, investors are mining intelligence from GPUs, and they truly believe they are one inch from discovering the biggest goldmine in history. But GPUs, unlike a goldmine, cannot be inspected for traces of gold by independent contractors. To keep the hype up, the nihilists in our story dig up cheap gold-looking metals from time to time and tell investors that with a bit of alchemy - agentic workflows, etc. - those metals can be magically turned into gold.

Investors will keep digging until the end of the age, or until they run out of money.

reply

upvote

by delichon23 minutes ago|

[-]

This metaphor would have AI as producing no real value yet, but may at some point in the future after a dramatic breakthrough, like striking a gold vein. Yet I look around and see gobs of people, including me, using it right now to seriously accelerate cognitive work. We're seeing signs of the economy as a whole reshaping to it. The vision of a few hucksters leading the lemmings doesn't wash, we lemmings are in on it too.

reply

upvote

by rbbydotdev3 hours ago|

[-]

This is interesting, anecdotally I have felt like I was having better luck with raw sqlite than using an ORM in a recent typescript project, using raw sqlite queries vs drizzle

reply

upvote

by oulipo22 hours ago|

[-]

Exactly why you can't remove humans in the loop to assess that the solution is not only correct (which LLMs are quite bad at, once concurrency, logic, etc are involved), but also elegant, maintainable, etc

reply

upvote

by phrotoma2 hours ago|

[-]

"constraint decay" isn't this just another name for the (already well understood) idea of "context rot"?

reply

upvote

by volume_tech6 hours ago|

[-]

[flagged]

reply

upvote

by spacedoutman1 hours ago|

[-]

This research is useless and nearly all other LLM research is too.

gpt 5.2 is the strongest model they tested, a nearly 6 month old model.

Traditional research can not keep up.

reply

upvote

by acgourley1 hours ago|

[-]

I disagree, their findings should generalize to the frontier. Even if the latest can deal with the extra complexity, it stands to reason it will take more tokens to do less. This could be a useful insight into the next generation of evals.

reply