undefined

upvote

points

by slaye9 hours ago |

upvote

by simonw9 hours ago|

[-]

This is genuinely one of the most interesting questions right now. I don't have solid answers yet, and I'm very keen to learn what people are finding works.

If you accelerate the pace of code creation it inevitably creates bottlenecks elsewhere. Code review is by far the biggest of those right now.

There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?

But I don't want to lower my standards! I want the code I'm producing with coding agents to be better than the code I would produce without them.

There are some aspects of code review that you cannot skimp on. Things like coding standards may not matter as much, but security review will never be optional.

I've recently been wondering what we can learn from security teams at large companies. Once you have dozens or hundreds of teams shipping features at the same time - teams with varying levels of experience - you can no longer trust those teams not to make mistakes. I expect that the same strategies used by security teams at Facebook/Google-scale organizations could now be relevant to smaller organizations where coding agents are responsible for increasing amounts of code.

Generally though I think this is very much an unsolved problem. I hope to document the effective patterns for this as they emerge.

reply

upvote

by malexw8 hours ago|

[-]

I think Martin Fowler's "Refactoring" might give a bit of insight here. One of my take-aways after reading that book is that the specific implementation of a function is not very important if you have tests. He argues that it can sometimes be easier to completely re-write a function than to take the time to understand it - as long as you can validate that your re-write performs the same way. This mindset lines up pretty closely with how I've been using LLMs.

If that's true, then I would think the emphasis in code review should be more on test quality and verifying that the spec is captured accurately, and as you suggest, the actual implementation is less important.

reply

upvote

by DanHulton6 hours ago|

[-]

This is why I've been pushing back on the "just have the AI generate the tests!" mentality. Sure, let it help you, but those tests are the guarantee of quality and fit for purpose. If you vibe code them, how the hell do you know if it even does what you think it does?

You should be planning out the tests to properly exercise the spec, and ensuring those tests actually do what the spec requires. AI can suggest more tests (but be careful here, too, because a ballooned test suite slows down CICD), but it should never be in charge of them completely.

reply

upvote

by MattGrommes2 hours ago|

[-]

A related book I've been thinking about in terms of LLMs is "Working Effectively With Legacy Code". I'd love to be able to work a lot of that advice into some kind of Skill or customized agent to help with big refactors.

reply

upvote

by malexw2 hours ago|

[-]

Oh gosh - now that you mention it, it was "Working Effectively with Legacy Code" that I was thinking of, not "Refactoring".

reply

upvote

by fhd25 hours ago|

[-]

That's my experience with agentic development so far, a lot of extra time goes into testing.

Problem is, the way I've been trained to test isn't exactly antagonistic. QA does that kind of thing. Programmers writing tests are generally rather doing spot checks that only make sense if the code is generally understood and trustworthy. Code LLMs produce is usually broken in subtle, hard to spot ways.

reply

upvote

by ep1038 hours ago|

[-]

Counter-point, developers that get used to not caring about function implementation, are going to culturally also not care as much about test implementation, making this proposed ideal impossible.

reply

upvote

by lmf4lol6 hours ago|

[-]

with LLMs, tests cost nearly nothing of effort but provide tremendous value.

reply

upvote

by contagiousflow5 hours ago|

[-]

And you know those tests are correct how?

reply

upvote

by simonw4 hours ago|

[-]

Look at what they are testing.

reply

upvote

by cma2568 hours ago|

[-]

> There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?

Agree with everything else you said except this. In my opinion, this assumes code becomes more like a consumable as code-production costs reduce. But I don't think that's the case. Incorrect, but not visibly incorrect, code will sit in place for years.

reply

upvote

by simonw8 hours ago|

[-]

> Agree with everything else you said except this.

Yeah, I'm not sure I agree with what I said there myself!

> Incorrect, but not visibly incorrect, code will sit in place for years.

If you let incorrect code sit in place for years I think that suggests a gap in your wider process somewhere.

I'm still trying to figure out what closing those gaps looks like.

The StrongDM pattern is interesting - having an ongoing swarm of testing agents which hammer away at a staging cluster trying different things and noting stuff that breaks. Effectively an agent-driven QA team.

I'm not going to add that to the guide until I've heard it working for other teams and experienced it myself though!

reply

upvote

by Balgair6 hours ago|

[-]

This kinda gets into the idea of AIs as droids right?

So, you have a code writing droid that is aligned towards writing good clean code that humans can read. Then you have an implementation droid that goes into actually launching and running the code and is aligned with business needs and expenses. And you have a QA droid that stress tests the code and is aligned with the hacker mindset and is just slightly evil, so to speak.

Each droid is working together to make good code, but also are independent and adversarial in the day to day.

reply

upvote

by aprdm3 hours ago|

[-]

These are just agents with a different name ? People have been working like that today.

reply

upvote

by pixl972 hours ago|

[-]

Theoretically I'd want a totally different model cross checking the work at some point, since much like an individual may have blind spots, so will a model.

reply

upvote

by esafak4 hours ago|

[-]

It assumes that bugs are rare and easy to fix. A look at Claude Code's issue tracker (https://github.com/anthropics/claude-code/issues) tells you that this is not so. Your product could be perpetually broken, lurching from one vibe coded bug to another.

reply

upvote

by atomicUpdate5 hours ago|

[-]

> When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?

I don't care how cheap it is to replace the incorrect code when it's modifying my bank account or keeping my lights on.

reply

upvote

by pixl972 hours ago|

[-]

Oh, don't worry, even before AI the companies in question were already outsourcing a lot of this to the cheapest companies they could find. We are just very very lucky most of the problems incurred get caught before being foisted on the wider world.

reply

upvote

by benmathes4 hours ago|

[-]

One model I've seen is moving the review stage to the designs, not the code itself.

I.e. have a `planning/designs/unbuilt/...` folder that contains markdown descriptions of features/changes that would have gotten a PR. Now do the review at the design level.

reply

upvote

by keithnz5 hours ago|

[-]

Agent based code reviews is what you want. But you have to do set it up with really good context about what is wanted. You then review the reviews, keep improving the context it is working with. Make sure it's put into everyone's global context they work with as well.

Weirdly this article doesn't really talk about the main agentic pattern

- Plan (really important to start with a plan before code changes). iteratively build a plan to implement something. You can also have a colelctive review of the plan, make sure its what you want and there is guidance about how it should implement in terms of architecture (should also be pulling on pre existing context about your architecure /ccoding standards), what testing should be built. Make sure the agent reviews the plan, ask the agent to make suggestions and ask questions

- Execute. Make the agent (or multiple agents) execute on the plan

- Test / Fix cycle

- Code Review / Refactor

- Generate Test Guidance for QA

Then your deliverables are Code / Feature context documentation / Test Guidance + evolving your global/project context

reply

upvote

by simonw4 hours ago|

[-]

I'm still trying to figure out how to write about planning.

The problem is Claude Code has a planning mode baked in, which works really well but is quite custom to how Claude Code likes to do things.

When I describe it as a pattern I want to stretch a little beyond the current default implementation in one of the most popular coding agents.

reply

upvote

by jgraettinger13 hours ago|

[-]

Maintaining a high-quality requirements / specification document for large features prior to implementation, and then referencing it in "plan mode" prompts, feels like consensus best practice at this stage.

However a thing I'm finding quite valuable in my own workflows, but haven't seen much discussion of, is spending meaningful time with AI doing meta-planning of that document. For example, I'll spend many sessions partnered with AI just iterating on the draft document, asking it to think through details, play contrarian, surface alternatives, poke holes, identify points of confusion, etc. It's been so helpful for rapidly exploring a design space, and I frequently find it makes suggestions that are genuinely surprising or change my perspective about what we should build.

I feel like I know we're "done" when I thoroughly understand it, a fresh AI instance seems to really understand it (as evaluated by interrogating it), and neither of us can find anything meaningful to improve. At that point we move to implementation, and the actual code writing falls out pretty seamlessly. Plus, there's a high quality requirements document as a long-lived artifact.

Obviously this is a heavyweight process, but is suited for my domain and work.

ETA one additional practice: if the agent gets confused during implementation or otherwise, I find it's almost always due to a latent confusion about the requirements. Ask the agent why it did a thing, figure out how to clarify in the requirements, and try again from the top rather than putting effort into steering the current session.

reply

upvote

by ramoz3 hours ago|

[-]

> consensus best practice

I'm not sure I agree with this. I don't think there needs to be a whole spec & documentation process before plan mode.

There is alternative thought leadership that the waterfall approach for building out projects is not the right Agentic pattern[1].

Planning itself can be such an intensive process where you're designing and figuring out the specs on the fly in a focused manner for the thing the agent will actually develop next. Not sure how useful it is to go beyond this in terms of specs that live outside of the Agentic loop for what should be developed now and next.

I've evolved my own process, originally from plain Claude Code to Claude Code with heavy spec integrated capabilities. However, that became a burden for me: a lot of contextual drift in those documents and then self managing & orchestrating of Claude Code over those documents. I've since reoriented myself to base Claude Code with a fairly high-level effort specific to ad-hoc planning sessions. Sometimes the plans will revolve around specific GitHub issues or feature requests in the ticketing system, but that's about it.

[1] https://boristane.com/blog/the-software-development-lifecycl...

reply

upvote

by jgraettinger12 hours ago|

[-]

I think there's a Danger Zone when planning is light-weight and iterative, and code is cheap, but reviewing code is expensive: it leads to a kind of local hill-climbing.

Suppose you iterate through many sessions of lightweight planning, implementation, and code review. It _feels_ high velocity, you're cranking through the feature, but you've also invested a lot of your time and energy (planning isn't free, and code review and fit-for-purpose checks, in particular, are expensive). As often happens -- with or without AI -- you get towards the end and realize: there might have been a fundamentally better approach to take.

The tradeoff of that apparent velocity is that _now_ course correction is much more challenging. Those ephemeral plans are now gone. The effort you put into providing context within those plans is gone. You have an locally optimal solution, but you don't have a great way of expressing how to start over from scratch pointed in a slightly different direction.

I think that part can be really valuable, because given a sufficiently specific arrow, the AI can just rip.

Whether it's worth the effort, I suppose, depends on how high-conviction you are on your original chosen approach.

reply

upvote

by ramoz2 hours ago|

[-]

nice thoughts

reply

upvote

by eterps4 hours ago|

[-]

You could have a look at: https://github.com/jurriaan/aico

It does 2 things that are very important, 1: reviewing should not be done last, but during the process and 2: plans should result into verifyable specs, preferably in a natural language so you can avoid locking yourself into specific implementation details (the "how") too early.

reply

upvote

by ramoz5 hours ago|

[-]

> what testing should be built

Yea, a big part of my planning has included what verification steps will be necessary along the way or at the end. No plan gets executed without that and I often ask for specific focus on this aspect in plan mode.

reply

upvote

by keithnz5 hours ago|

[-]

yeah, spending a bunch of time with the plan is really worthwhile, nearly all aspects of the plan are worth a bunch of attention. Getting it to think about edge cases and all the scenarios for testing is really worthwhile, what can be automated, what manual testing should be done. It's often working through testing scenarios that I often see gaps in the plan.

reply

upvote

by pc865 hours ago|

[-]

"It works, why change it?" is a horrible attitude but is an organizational and interpersonal problem, not a technical one. They're only 1/3 of the way done according to Kent Beck.¹

There are plenty of orgs using AI who still care about architecture and having easily human-readable, human-maintainable code. Maybe that's becoming an anachronism, and those firms will go the way of the Brontosaurus. Maybe it will be a competitive advantage. Who knows?

¹ "Make it work, make it right, make it fast."

reply

upvote

by manquer5 hours ago|

[-]

The way I am handing this - investing heavily in static and dynamic analysis aspects.

- A lot more linting rules than ever before, also custom rule sets that do more org and project level validations.

- Harder types enforcement in type optional languages , Stronger and deeper typing in all of them .

- beyond unit tests - test quality coverage tooling like mutation testing(stryker) and property based testing (quickcheck) if you can go that precise

- much more dx scripts and build harnesses that are specific to org and repo practices that usually junior/new devs learn over time

- On dynamic side , per pull requests environments with e2e tests that agents can validate against and iterate when things don’t work.

- documentation generation and skill curation. After doing a batch of pull requests reviews I will spend time in seeing where the gaps are in repo skills and agents.

All this becomes pre-commit heavy, and laptops cannot keep up in monorepos, so we ended up doing more remote containers on beefy machines and investing and also task caching (nx/turborepo have this )

Reviews (agentic or human) have their uses , but doing this with reviews is just high latency, inefficient and also tends to miss things and we become the bottleneck.

Earlier the coder(human or agent) gets repeatable consistent feedback it is better

reply

upvote

by layer83 hours ago|

[-]

> I'm met with much pushback because "it works, why change it"?

This is an educational problem, and is unlikely to be easy to fix in your team (though I might be wrong). I would suggest to change to a team or company with a culture that values being able to reason about one’s software.

reply

upvote

by TeeWEE4 hours ago|

[-]

We make the creator of the PR responsible for the code. Meaning they must understand it.

Also, we only allow engineers to commit (agent generated) code. Designers just come up with suggestions, engineers take it and ensure it fits our architecture.

We do have a huge codebase. We are teaching Claude Code with CLAUDE.md's and now also <feature>.spec.md (often a summary of the implementation plan).

In the end, engineers are responsible.

reply

upvote

by yonaguska8 hours ago|

[-]

Can you document the hard architectural requirements of your codebase? And keep it up to date? If you can do that, you can force your coworkers to always use those requirements during their prompting /planning for their implementations and you can feed that to an agent and have that review the code.

But more proactively, if people aren't going to write their own code, I think there needs to be a review process around their prompts, before they generate any code at all. Make this a formal process, generate the task list you're going to feed to your LLM, write a spec, and that should be reviewed. This is not a substitute for code reviews, but it tends to ensure that there are only nitpick issues left, not major violations of how the system is intended to be architected.

reply

upvote

by ornornor8 hours ago|

[-]

I’m running into this problem as well with juniors slinging code that takes me a very long time to understand and review. I’m iterating on an AGENTS.md file to share with them because they aren’t going to stop using AI and I’m a little tied of always saying the same things (Claude loves to mock everything and assert that spies were called X times with Y arguments which is a great recipe for brittle tests, for example)

I know they won’t stop using AI so giving them a directives file that I’ve tried out might at least increase the quality of the output and lower my reviewing burden.

Open to other ideas too :)

reply

upvote

by esafak4 hours ago|

[-]

Have an AI reviewer take the first crack at it after pointing it to your rules file (e.g., AGENTS) so you don't have to repeat yourself. Gemini does this fairly well, for example. https://developers.google.com/gemini-code-assist/docs/review...

reply

upvote

by jf228 hours ago|

[-]

How are the architecture changes you are proposing improving the end result?

>but not being able to control the exact output worries me

Why?

reply

upvote

by esafak8 hours ago|

[-]

Code review should be mandatory and reviewers should ask big PRs to be broken up, and its submitters to be able to defend every line of code. For when the computer is generating the code, the most important duty of the submitter is to vouch for it. To do otherwise creates the bad incentive of making others do all your QA, and nobody is going to be rewarded for that.

reply

upvote

by simonw4 hours ago|

[-]

I just added a chapter which touches on that: https://simonwillison.net/guides/agentic-engineering-pattern...

reply

upvote

by simonw8 hours ago|

[-]

Yeah, I think that's one of the biggest anti-patterns right now: dumping thousands of lines of agent-generated code on your team to review, which effectively delegates the real work to other people.

reply

upvote

by xXSLAYERXx5 hours ago|

[-]

> Code review should be mandatory and reviewers should ask big PRs to be broken up

Always, even before all this madness. It sounds more like a function of these teams CR process rather than agents writing the code. Sometimes super large prs are necessary and I've always requested a 30 minute meeting to discuss.

I don't see this as an issue, just noise. Reduce the PR footprint. If not possible meet with the engineer(s)

reply

upvote

by fantasizr8 hours ago|

[-]

Code review is now a bit like Brandolini's law: "The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it." You ultimately need a lot of buy in to spend more than 5 mins on something that took 5 seconds to produce.

reply