upvote
> Neither do people, yet people manage to write software that they can evolve over a long time

You need a specific methodology to do that, one that separates "programming in the large" (the interaction across program modules) from "programming in the small" within a single, completely surveyable module. In an agentic context, "surveyable" code realistically has to imply a manageable size relative to the agent's context. If the abstraction boundaries across modules leak in a major way (including due to undocumented or casually broken invariants) that's a bit of a disaster, especially wrt. evolvability.

reply
Agents just can't currently do that well. When you run into a problem when evolving the code to add a new feature or fix a bug, you need to decide whether the change belongs in the architecture or should be done locally. Agents are about as good as a random choice in picking the right answer, and there's typically only one right answer. They simply don't have the judgment. Sometimes you get the wrong choice in one session and the right choice in another.

But this happens at all levels because there are many more than just two abstraction levels. E.g. do I change a subroutine's signature or do I change the callsite? Agents get it wrong. A lot.

Another thing they just don't get (because they're so focused on task success) is that it's very often better to let things go wrong in a way that could inform changes rather than get things to "work" in a way that hides the problem. One of the reasons agent code needs to be reviewed even more carefully than human code is that they're really good at hiding issues with potentially catastrophic consequences.

reply
> Agents are about as good as a random choice in picking the right answer, and there's typically only one right answer.

That's realistically because they aren't even trying to answer that question by thinking sensibly about the code. Working in a limited context with anything they do leaves them guessing and trying the first thing that might work. That's why they generally do a bit better when you explicitly ask them to reverse engineer/document a design of some existing codebase: that's a problem that at least involves an explicit requirement to comprehensively survey the code, figure out what part matters, etc. They can't be expected to do that as a default. It's not even a limitation of existing models, it's quite inherent to how they're architected.

reply
Yes, and I think there's a fundamental problem here. The big reason the "AI thought leadership" claim that AI should do well at coding is because there are mechanical success metrics like tests. Except that's not true. The tests cover the behaviour, not the structure. It's like constructing a building where the only tests are whether floorplans match the design. It makes catastrophic strctural issues easy to hide. The building looks right, and it might even withstand some load, but later, when you want to make changes, you move a cupboard or a curtain rod only to have the structure collapse because that element ended up being load-bearing.

It's funny, but one of the lessons I've learnt working with agents is just how much design matters in software and isn't just a matter of craftsmenship pride. When you see the codebase implode after the tenth new feature and realise it has to be scrapped because neither human nor AI can salvage it, the importance of design becomes palpable. Before agents it was hard to see because few people write code like that (just as no one would think to make a curtain rod load-bearing when building a structure).

And let's not forget that the models hallucinate. Just now I was discussing architecture with Codex, and what it says sounds plausible, but it's wrong in subtle and important ways.

reply
> The big reason the "AI thought leadership" claim that AI should do well at coding is because there are mechanical success metrics like tests.

I mean, if you properly define "do well" as getting a first draft of something interesting that might or might not be a step towards a solution, that's not completely wrong. A pass/fail test is verified feedback of a sort, that the AI can then do quick iteration on. It's just very wrong to expect that you can get away with only checking for passing tests and not even loosely survey what the AI generated (which is invariably what people do when they submit a bunch of vibe-coded pull requests that are 10k lines each or more, and call that a "gain" in productivity).

reply
It's not completely wrong if you're interested in a throwaway codebase. It is completely wrong if what you want is a codebase you'll evolve over years. Agents are nowhere close to offering that (yet) unless a human is watching them like a hawk (closer than you'd watch another human programmer, because human programmers don't make such dangerous mistakes as frequently, and when they do make them, they don't hide them as well).
reply