upvote
Debugging would suffer as well, I assume. There's this old adage that if you write the cleverest code you can, you won't be clever enough to debug it.

There's nothing really stopping agents from writing the cleverest code they can. So my question is, when production goes down, who's debugging it? You don't have 10 days.

reply
> the art is load-bearing

This is beautiful

reply
Oh my god have Anthropic products been absolutely saying everything is load bearing for the last week or so. Literally ever other paragraph has “such and such is load-bearing”.

Funny enough, today it seems to have stopped…

reply
They can work really well if you put sufficient upfront engineering into your architecture and it's guardrails, such that agents (nor humans) basically can't produce incorrect code in the codebase. If you just let them rip without that, then they require very heavy baby-sitting. With that, they're a serious force-multiplier.
reply
They don't work really well even on relatively small things and even with a virtually impractical upfront engineering: https://news.ycombinator.com/item?id=47752626

They just make a lot of mistakes that compound and they don't identify. They currently need to be very closely supervised if you want the codebase to continue to evolve for any significant amount of time. They do work well when you detect their mistakes and tell them to revert.

reply
We've done the impractical upfront engineering, and they're working well for us :)
reply
The problem is, the MBAs running the ship are convinced AI will solve all that with more datacenters. The fact that they talk about gigawatts of compute tells you how delusional they are. Further, the collateral damage this delusion will occur as these models sigmoid their way into agents, and harnesses and expert models and fine tuned derivatives, and cascading manifold intelligent word salad excercises shouldn't be under concerned.
reply
Why worry?
reply
A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec. Whether it be humans or agents, people rarely specify that one explicitly but treat it as an assumed bit of knowledge.

It goes the other way quite often with people. How often do you see K8s for small projects?

reply
> A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec

I wish it could, but in practice, today's agents just can't do that. About once a week I reach some architectural bifurcation where one path is stable and the other leads to an inevitable total-loss catastrophe from which the codebase will not recover. The agent's success rate (I mostly use Codex with gpt5.4) is about 50-50. No matter what you explain to them, they just make catastrophic mistakes far too often.

reply
Something is missing in the common test suite if this can occur, right?
reply
First, it's not "can occur" but does occur 100% of the time. Second, sure, it does mean something is missing, but how do you test for "this codebase can withstand at least two years of evolution"?
reply
You can spend a lot of time perfecting the test suite to meet your specific requirements and needs, but I think that would take quite a while, and at that point, why not just write the code yourself? I think the most viable approach of today's AI is still to let it code and steer it when it makes a decision you don't like, as it goes along.
reply
You have to fight to get agents to write tests in my experience. It can be done, but they don't. I've yet to figure out how get any any agent to use TDD - that is write a test and then verify it fails - once in a while I can get it to write one test that way, but it then writes far more code to make it pass than the test justifies and so is still missing coverage of important edge cases.
reply
Instead of fighting agents to write tests, what if the testing agent is the product itself? That's the idea behind Autonoma (https://github.com/autonoma-ai/autonoma), AI agents that do E2E testing by exploring your app like a real user.
reply
I have TDD flow working as a part of my tasks structuring and then task completion. There are separate tasks for making the tests and for implementing. The agent which implements is told to pick up only the first available task, which will be “write tests task”, it reliably does so. I just needed to add how it should mark tests as skipped because it’s been conflicting with quality gates.
reply
Maybe, and that unknown unknown is the biggest footgun with LLMs in just about every regard.
reply
This just sounds like incomplete specs to me. And poor testing.
reply
It isn't. Anthropic tried building a fairly simple piece of software (a C compiler) with a full spec, thousands of human-written tests, and a reference implementation - all of which were made available to the agent and the model trained on. It's hard to imagine a better tested, better-specified project, and we're talking about 20KLOC. Their agents worked for two weeks and produced a 100KLOC codebase that was unsalvageable - any fix to one thing broke another [1]. Again, their attempt was to write software that's smaller, better tested, and better specified than virtually any piece of real software and the agents still failed.

Today's agents are simply not capable enough to write evolvable software without close supervision to save them from the catastrophic mistakes they make on their own with alarming frequency.

Specifically, if you look at agent-generated code, it is typically highly defensive, even against bugs in its own code. It establishes an invariant and then writes a contingency in case the invariant doesn't hold. I once asked it to maintain some data structure so that it could avoid a costly loop. It did, but in the same round it added a contingency (that uses the expensive loop) in the code that consumes the data structure in case it maintained it incorrectly.

This makes it very hard for both humans and the agent to find later bugs and know what the invariants are. How do you test for that? You may think you can spec against that, but you can't, because these are code-level invariants, not behavioural invariants. The best you can do is ask the agent to document every code-level invariant it establishes and rely on it. That can work for a while, but after some time there's just too much, and the agent starts ignoring the instructions.

I think that people who believe that agents produce fine-but-messy code without close supervision either don't carefully review the code or abandon the project before it collapses. There's no way people who use agents a lot and supervise them closely believe they can just work on their own.

[1]: https://www.anthropic.com/engineering/building-c-compiler

reply
"Incomplete specs" is the way of the world. Even highly engineered projects like buildings have "incomplete specs" because the world is unpredictable and you simply cannot anticipate everything that might come up.
reply
A sufficiently complete spec is indistinguishable from source code.
reply
And sometimes it can't even handle it then. I was recently porting ruby web code to python. Agents were simultaneously surprisingly good (converting ActiveRecord to sqlalchemy ORM) and shockingly, incapably bad.

For example, ruby uses blocks a lot. Ruby blocks are curious little thingies because they are arguably just syntax sugar for a HOF, but man it's great syntax sugar. Python then has "yield" which is simultaneously the same keyword ruby uses for blocks, but works fundamentally differently (instead of just a HOF, it's for generating an iterator/generator) and while there are some decorators that can use yield's ability to "pause" execution in the function to send control flow back out of the function for a moment (@contextmanager) which feels _even more_ like ruby blocks, it's a rather limited trick and requires the decorator to adapt the Generator to a context manager and there's just no good way to generalize that.

Somehow this is the perfect storm to make LLMs completely incapable of converting ruby code that uses blocks for more than the basic iteration used in the stdlib. It will try to port to python code that is either nonsensical, or uses yield incorrectly and doesn't actually work (and in a way that type checkers can even spot). And furthermore, even if you can technically whack it with a hammer until it works with yield, it's often not at all the way to do it. Ruby devs use blocks not-uncommonly while python devs are not really going to be using yield often at all, perhaps outside of @contextmanager. So the right move is usually to just restructure control flow to not need to use blocks/HOFs (or double down and explicitly pass in a function). (Rubyists will cringe at this, and rightly so... Ruby is often extraordinarily expressive).

The fact that such a simple language feature trips them up so completely is pretty odd to me. I guess maybe their training data doesn't include a lot of ruby-to-python conversions. Maybe that's indicative of something, but I digress.

reply
The trick is to generate a spec from the old code, then generate the new codebase from the spec.
reply
Lol I largely agree with my beloved dissenters, just not on the same magnitude. I understand complete specs are impossible and equivalent to source code via declaration. My disagreement is with this particular part:

"t's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. "

If your test/design of a BUILDING doesn't include at simulations/approximations of such easy to catch structural flaws, its just bad engineering. Which rhymes a lot with the people that hate AI. By and large, they just don't use it well.

reply
We call the complete specs "source code".
reply
... and it still doesn't work. In the Anthropic experiement, the model was trained on a reference implementation and the agents still failed.
reply