undefined

upvote

points

by egeozcan5 hours ago |

upvote

by magicalist5 hours ago|

[-]

> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.

I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.

Reward hacking is very real and hard to guard against.

reply

upvote

by egeozcan4 hours ago|

[-]

The trick is, with the setup I mentioned, you change the rewards.

The concept is:

Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.

Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.

Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.

It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.

reply

upvote

by w4yai2 hours ago|

[-]

You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.

What kind of setup do you use ? Can you share ? How much does it cost ?

reply

upvote

by throwaway77839 minutes ago|

[-]

We have a very uncomplicated setup with claude code. A CLAUDE.md with instructions and notes about the repo and how to run stuff. We also do code reviews with Claude Code, but in a separate session.

It works wonderfully well. Costs about $200USD per developer per month as of now.

reply

upvote

by aprdm1 hours ago|

[-]

If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results

reply

upvote

by mrbungie30 minutes ago|

[-]

I can't really tell if this is sarcasm or not.

reply

upvote

by dworks2 hours ago|

[-]

rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

(I built it)

reply

upvote

by cheema3329 minutes ago|

[-]

Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.

reply

upvote

by _ink_2 hours ago|

[-]

Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?

reply

upvote

by stavros2 hours ago|

[-]

Recursive language models: https://github.com/doubleuuser/rlm-workflow

reply

upvote

by canadiantim1 hours ago|

[-]

Check out Mike Pocock’s work, he’s done excellent work writing about red green refactor and has a GitHub repo for his skills. Read and take what you need from his tdd skill and incorporate it into your own tdd skill tailored for your project.

reply

upvote

by nojito45 minutes ago|

[-]

This is just ai slop. If you follow what the actual designers of Claude/GPT tell you it flys in the face of building out over engineered harnesses for agents.

reply

upvote

by throwaway77838 minutes ago|

[-]

I agree with this. There is not a lot of harnesses/wrapping needed for Claude Code.

reply

upvote

by canadiantim1 minutes ago|

[-]

You don't need a harness beyond Claude Code, but honestly it's foolish to think you shouldn't be building out extra skills to help your workflow. A TDD skill that does red-green-refactoring is using Claude Code exactly as how it's meant to be used. They pioneered skills.

reply

upvote

by canadiantim9 minutes ago|

[-]

Works better than standard claude / gpt, which doesn't do red-green-refactor. Doesn't seem like slop when it meaningfully changes the results for the better, consistently. Really is a game-changer. You should consider trying it.

reply

upvote

by tomtom13374 hours ago|

[-]

This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?

And do you have any prompts to share?

reply

upvote

by throwaway77834 minutes ago|

[-]

You don't need most of this. Prompts are also normally what you would say to another engineer.

* There is a lot of duplication between A & B. Refactor this.

* Look at ticket X and give me a root cause

* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds

Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc

reply

upvote

by xienze2 hours ago|

[-]

This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.

reply

upvote

by habinero2 hours ago|

[-]

It's assigning yourself the literal worst parts of the job - writing specs, docs, tests and reading someone else's code.

reply

upvote

by gedy1 hours ago|

[-]

Yes with the reward of: I don't understand this code and didn't learn anything incrementally about the feature I "planned".

reply

upvote

by skybrian4 hours ago|

[-]

How do you define visibility rules? Is that possible for subagents?

reply

upvote

by egeozcan4 hours ago|

[-]

AFAIK Claude doesn't support it, but if you're willing to go the extra mile, you can get creative with some bash script: https://pastebin.com/raw/m9YQ8MyS (generated this a second ago - just to get the point across )

To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.

I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.

reply

upvote

by achierius3 hours ago|

[-]

> To be clear, I don't do this.

How do you know that it works then? Are you using a different tool that does support it?

reply

upvote

by skybrian4 hours ago|

[-]

So what do you do? Do you define roles somewhere and tell the agent to assign these roles to subagents?

reply

upvote

by ssk422 hours ago|

[-]

Fun to see you not on tildes.

Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.

So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.

reply

upvote

by skybrian26 minutes ago|

[-]

Sounds interesting, but I'm not quite getting the relevance for people writing code with an agent. Should I be doing evals?

reply

upvote

by lagrange775 hours ago|

[-]

> Reward hacking is very real and hard to guard against.

Is it really about rewards? Im genuinely curious. Because its not a RL model.

reply

upvote

by gbnwl4 hours ago|

[-]

I'm noticing terms related to DL/RL/NLP are being used more and more informally as AI takes over more of the cultural zeitgeist and people want to use the fancy new terms of the era, even if inaccurately. A friend told me he "trained and fine tuned a custom agent" for his work when what he meant was he modified a claude.md file.

reply

upvote

by hexaga3 hours ago|

[-]

There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.

That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.

reply

upvote

by lagrange7754 minutes ago|

[-]

> There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.

reply

upvote

by hexaga12 minutes ago|

[-]

RL is simply a broad category of training methods. It's not really an architecture per se: modern GPTs are trained first on reconstruction objective on massive text corpora (the 'large language' part), then on various RL objectives +/- more post-training depending on which lab.

reply

upvote

by magicalist4 hours ago|

[-]

> Is it really about rewards? Im genuinely curious. Because its not a RL model.

Ha, good point. I was using it informally (you could handwave and call it an intrinsic reward if a model is well aligned to completing tasks as requested), but I hadn't really thought about it.

Searching around, it seems like I'm not alone, but it looks like "specification gaming" is also sometimes used, like: https://deepmind.google/blog/specification-gaming-the-flip-s...

reply

upvote

by nurettin4 hours ago|

[-]

They probably meant goal hacking. (I just made that up)

reply

upvote

by SoftTalker4 hours ago|

[-]

A refactor should not affect the tests at all should it? If it does, it's more than a refactor.

reply

upvote

by gchamonlive4 hours ago|

[-]

It can if your refactor needs to deal with interface changes, like moving methods around, changing argument order etc... all these need to propagate to the tests

reply

upvote

by bluGill3 hours ago|

[-]

Your tests are an assertion that 'no matter what this will never change'. If your interface can change then you are testing implementation details instead of the behavior users care about.

the above is really hard. A lot of tdd 'experts' don't understand is and teach fragile tests that are not worth having.

reply

upvote

by 8note49 minutes ago|

[-]

https://www.hyrumslaw.com/

your implementation is your interface. its a bit naive or hating-your-users to assume your tests are what your users care about. theyre dealing with everything, regardless of what youve tested or not.

reply

upvote

by switchbak1 hours ago|

[-]

Refactoring is changing the design of the code without affecting the behaviour.

You can change an interface and not change the behaviour.

I have rarely heard such a rigid interpretation such as this.

reply

upvote

by magicalist4 hours ago|

[-]

It depends on what you mean by "refactor" and how exactly you're testing, I guess, but that's not really at the heart of the point. red-green-refactor could also be used for adding new features, for instance, or an entire codebase, I guess.

reply

upvote

by joegaebel1 hours ago|

[-]

I've been able to encode Outside-in Test Driven Development into a repeatable workflow. Claude Code follows it to a T, and I've gotten great results. I've written about it more here, and created a repo people can use out of the box to try it out:

https://www.joegaebel.com/articles/principled-agentic-softwa... https://github.com/JoeGaebel/outside-in-tdd-starter

reply

upvote

by SequoiaHope4 hours ago|

[-]

I’m telling it to use red/green tdd [1] and it will write test that don’t fail and then says “ah the issue is already fixed” and then move on. You really have to watch it very closely. I’m having a huge problem with bad tests in my system despite a “governance model” that I always refer it to which requires red/green tdd.

[1] https://simonwillison.net/guides/agentic-engineering-pattern...

reply

upvote

by codybontecou5 hours ago|

[-]

This sounds interesting. Can you go a bit deeper or provide references on how to implement the green/red/refactor subagent pattern?

reply

upvote

by elemeno5 hours ago|

[-]

It’s not an agentic pattern, it’s an approach to test driven development.

You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).

reply

upvote

by pastescreenshot5 hours ago|

[-]

What has worked better for me is splitting authority, not just prompts. One agent can touch app code, one can only write failing tests plus a short bug hypothesis, and one only reviews the diff and test output. Also make test files read only for the coding agent. That cuts out a surprising amount of self-grading behavior.

reply

upvote

by huslage3 hours ago|

[-]

How do you limit access like that?

reply

upvote

by dworks2 hours ago|

[-]

I built rlm-workflow which has stage gating, TDD and sub-agent support: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

reply

upvote

by dmd5 hours ago|

[-]

That's the cool bit - you don't have to. CC is perfectly well aware and competent to implement it; just tell it to.

reply

upvote

by irishcoffee5 hours ago|

[-]

"So this is how liberty dies... with thunderous applause.” - Padmé Amidala

s/liberty/knowledge

reply

upvote

by osigurdson3 hours ago|

[-]

So more stuff happens with this approach but how do you know what it generates is correct?

reply

upvote

by afro885 hours ago|

[-]

Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.

reply

upvote

by Skidaddle4 hours ago|

[-]

How exactly do you set up your CC sessions to do this?

reply

upvote

by aray075 hours ago|

[-]

thats a great idea - i have been using codex to do my code reviews since i have it to give better critique on code written by claude but havent tried it with testing yet!

reply

upvote

by darkbatman5 hours ago|

[-]

codex/gpt is a stubborn model, doubt it would accept claude reviews or counter it. have seen cases where claude is more willing to comply if shared feedback though its just sycophancy too.

reply