undefined

upvote

points

by ottah7 hours ago |

upvote

by nickstinemates6 hours ago|

[-]

You write a generic architecture document on how you want your code base to be organized, when to use pattern x vs pattern y, examples of what that looks like in your code base, and you encode this as a skill.

Then, in your prompt you tell it the task you want, then you say, supervise the implementation with a sub agent that follows the architecture skill. Evaluate any proposed changes.

There are people who maximize this, and this is how you get things like teams. You make agents for planning, design, qa, product, engineering, review, release management, etc. and you get them to operate and coordinate to produce an outcome.

That's what this is supposed to be, encoded as a feature instead of a best practice.

reply

upvote

by satellite26 hours ago|

[-]

Aren't you just moving the problem a little bit further? If you can't trust it will implement carefully specified features, why would you believe it would properly review those?

reply

upvote

by frde_me4 hours ago|

[-]

It's hard to explain, but I've found LLMs to be significantly better in the "review" stage than the implementation stage.

So the LLM will do something and not catch at all that it did it badly. But the same LLM asked to review against the same starting requirement will catch the problem almost always

The missing thing in these tools is that automatic feedback loop between the two LLMs: one in review mode, one in implementation mode.

reply

upvote

by resonious4 hours ago|

[-]

I've noticed this too and am wondering why this hasn't been baked into the popular agents yet. Or maybe it has and it just hasn't panned out?

reply

upvote

by bashtoni4 hours ago|

[-]

Anecdotaly I think this is in Claude Code. It's pretty frequent to see it implement something, then declare it "forgot" a requirement and go back and alter or add to the implementation.

reply

upvote

by bethekidyouwant1 hours ago|

[-]

You have to dump the context window for the review to work good.

reply

upvote

by tclancy5 hours ago|

[-]

How does this not use up tokens incredibly fast though? I have a Pro subscription and bang up against the limits pretty regularly.

reply

upvote

by nickstinemates31 minutes ago|

[-]

I don't know, all I can say is with API-based billing, doing multi-thousand like refactors that would take days to do costs like $4. In terms of value : effort, it's incredible.

reply

upvote

by doctoboggan5 hours ago|

[-]

It _does_ use up tokens incredibly fast, which is probably why Anthropic is developing this feature. This is mostly for corporations using the API, not individuals on a plan.

reply

upvote

by digdugdirk5 hours ago|

[-]

I'd love to see a breakdown of the token consumption of inaccurate/errored/unused task branches for claude code and codex. It seems like a great revenue source for the model providers.

reply

upvote

by shafyy5 hours ago|

[-]

Yeah, that's what I was thinking. They do have an incentive to not get everything right on the first try, as long as they don't over do it... I also feel like that they try to get more token usage by asking unnecesary follow up questions that the user may say yes to etc.

reply

upvote

by andyferris5 hours ago|

[-]

It does use tokens faster, yes.

reply

upvote

by aqme286 hours ago|

[-]

I agree, but I've found that making an "adversarial" model within claude helps with the quality a lot. One agent makes the change, the other picks holes in it, and cycle. In the end, I'm left with less to review.

This sounds more like an automation of that idea than just N-times the work.

reply

upvote

by Keyframe6 hours ago|

[-]

Glad I'm not the only one. I do the same, but I tend to have gemini be the one that critiques.

reply

upvote

by diego8986 hours ago|

[-]

Do you do this manually? Or some abstraction above that? skills, some light orchestration, etc?

reply

upvote

by aqme285 hours ago|

[-]

I just tell it to do so, but you could even add that as a requirement to CLAUDE.md

reply

upvote

by turtlebits6 hours ago|

[-]

Humans can't handle large tasks either, which is why you break them into manageable chunks.

Just ask claude to write a plan and review/edit it yourself. Add success criteria/tests for better results.

reply

upvote

by stpedgwdgfhgdd7 hours ago|

[-]

Exactly, one out of four or three prompts require tuning, nudging or just stopping it. However it takes seniority to see where it goes astray. I suspect that lots of folks dont even notice that CC is off. It works, it passes the tests, so it is good.

reply

upvote

by nprz7 hours ago|

[-]

There is research[0] currently being done on how to divide tasks and combine the answers to LLMs. This approach allows LLMs reach outcomes (solving a problem that requires 1 million steps) which would be impossible otherwise.

[0]https://arxiv.org/abs/2511.09030

reply

upvote

by woah6 hours ago|

[-]

All they did was prompt an LLM over and over again to execute one iteration of a towers of hanoi algorithm. Literally just using it as a glorified scripting language:

```

Rules:

- Only one disk can be moved at a time.

- Only the top disk from any stack can be moved.

- A larger disk may not be placed on top of a smaller disk.

For all moves, follow the standard Tower of Hanoi procedure: If the previous move did not move disk 1, move disk 1 clockwise one peg (0 -> 1 -> 2 -> 0).

If the previous move did move disk 1, make the only legal move that does not involve moving disk1.

Use these clear steps to find the next move given the previous move and current state.

Previous move: {previous_move} Current State: {current_state} Based on the previous move and current state, find the single next move that follows the procedure and the resulting next state.

```

This is buried down in the appendix while the main paper is full of agentic swarms this and millions of agents that and plenty of fancy math symbols and graphs. Maybe there is more to it, but the fact that they decided to publish with such a trivial task which could be much more easily accomplished by having an llm write a simple python script is concerning.

reply

upvote

by Spoom4 hours ago|

[-]

Good lord, I can only imagine the wasted electricity.

reply

upvote

by ottah7 hours ago|

[-]

No offense to the academic profession, but they're not a good source of advice for best practices in commercial software development. They don't have the experience or the knowledge sufficient to understand my workplace and tasks. Their skill set and job is orthogonal to the corporate world.

reply

upvote

by nprz7 hours ago|

[-]

Yes, the problem solved in the paper (Tower of Hanoi) is far more easily defined than 99% of actual problems you would find in commercial software development. Still proof of "theoretically possible" and seems like an interesting area of research.

reply

upvote

by findjashua5 hours ago|

[-]

you need a reviewer agent for every step of the process - review the plan generated by the planner, the update made by the task worker subagent, and a final reviewer once all tasks are done.

this does eat up tokens _very_ quickly though :(

reply

upvote

by BonoboIO7 hours ago|

[-]

You definitely have to create some sort of PLAN.md and PROGRESS.md via a command and an implement command that delegates work. That is the only way that I can get bigger things done no matter how „good“ their task feature is.

You run out of context so quickly and if you don’t have some kind of persistent guidance things go south

reply

upvote

by ottah7 hours ago|

[-]

It's not sufficient, especially if I am not learning about the problem by being part of the implementation process. The models are still very weak reasoners, writing code faster doesn't accelerate my understanding of the code the model wrote. Even with clear specs I am constantly fighting with it duplicating methods, writing ineffective tests, or implementing unnecessarily complex solutions. AI just isn't a better engineer than me, and that makes it a weak development partner.

reply

upvote

by vonneumannstan5 hours ago|

[-]

>AI just isn't a better engineer than me, and that makes it a weak development partner.

This would also be true of Junior Engineers. Do you find them impossible to work with as well?

reply

upvote

by koakuma-chan7 hours ago|

[-]

I tried doing that and it didn't work. It still adds "fallbacks" that just hide errors or the fact that there is no actual implementation and "In a real app, we would do X, just return null for now"

reply