undefined

upvote

points

by magicalist5 hours ago |

upvote

by egeozcan4 hours ago|

[-]

The trick is, with the setup I mentioned, you change the rewards.

The concept is:

Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.

Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.

Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.

It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.

reply

upvote

by w4yai2 hours ago|

[-]

You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.

What kind of setup do you use ? Can you share ? How much does it cost ?

reply

upvote

by throwaway77838 minutes ago|

[-]

We have a very uncomplicated setup with claude code. A CLAUDE.md with instructions and notes about the repo and how to run stuff. We also do code reviews with Claude Code, but in a separate session.

It works wonderfully well. Costs about $200USD per developer per month as of now.

reply

upvote

by aprdm1 hours ago|

[-]

If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results

reply

upvote

by mrbungie29 minutes ago|

[-]

I can't really tell if this is sarcasm or not.

reply

upvote

by dworks2 hours ago|

[-]

rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

(I built it)

reply

upvote

by cheema3328 minutes ago|

[-]

Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.

reply

upvote

by _ink_2 hours ago|

[-]

Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?

reply

upvote

by stavros2 hours ago|

[-]

Recursive language models: https://github.com/doubleuuser/rlm-workflow

reply

upvote

by canadiantim1 hours ago|

[-]

Check out Mike Pocock’s work, he’s done excellent work writing about red green refactor and has a GitHub repo for his skills. Read and take what you need from his tdd skill and incorporate it into your own tdd skill tailored for your project.

reply

upvote

by nojito44 minutes ago|

[-]

This is just ai slop. If you follow what the actual designers of Claude/GPT tell you it flys in the face of building out over engineered harnesses for agents.

reply

upvote

by throwaway77837 minutes ago|

[-]

I agree with this. There is not a lot of harnesses/wrapping needed for Claude Code.

reply

upvote

by canadiantim45 seconds ago|

[-]

You don't need a harness beyond Claude Code, but honestly it's foolish to think you shouldn't be building out extra skills to help your workflow. A TDD skill that does red-green-refactoring is using Claude Code exactly as how it's meant to be used. They pioneered skills.

reply

upvote

by canadiantim8 minutes ago|

[-]

Works better than standard claude / gpt, which doesn't do red-green-refactor. Doesn't seem like slop when it meaningfully changes the results for the better, consistently. Really is a game-changer. You should consider trying it.

reply

upvote

by tomtom13374 hours ago|

[-]

This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?

And do you have any prompts to share?

reply

upvote

by throwaway77833 minutes ago|

[-]

You don't need most of this. Prompts are also normally what you would say to another engineer.

* There is a lot of duplication between A & B. Refactor this.

* Look at ticket X and give me a root cause

* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds

Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc

reply

upvote

by xienze2 hours ago|

[-]

This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.

reply

upvote

by habinero2 hours ago|

[-]

It's assigning yourself the literal worst parts of the job - writing specs, docs, tests and reading someone else's code.

reply

upvote

by gedy1 hours ago|

[-]

Yes with the reward of: I don't understand this code and didn't learn anything incrementally about the feature I "planned".

reply

upvote

by skybrian4 hours ago|

[-]

How do you define visibility rules? Is that possible for subagents?

reply

upvote

by egeozcan4 hours ago|

[-]

AFAIK Claude doesn't support it, but if you're willing to go the extra mile, you can get creative with some bash script: https://pastebin.com/raw/m9YQ8MyS (generated this a second ago - just to get the point across )

To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.

I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.

reply

upvote

by achierius3 hours ago|

[-]

> To be clear, I don't do this.

How do you know that it works then? Are you using a different tool that does support it?

reply

upvote

by skybrian4 hours ago|

[-]

So what do you do? Do you define roles somewhere and tell the agent to assign these roles to subagents?

reply

upvote

by ssk422 hours ago|

[-]

Fun to see you not on tildes.

Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.

So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.

reply

upvote

by skybrian25 minutes ago|

[-]

Sounds interesting, but I'm not quite getting the relevance for people writing code with an agent. Should I be doing evals?

reply

upvote

by lagrange775 hours ago|

[-]

> Reward hacking is very real and hard to guard against.

Is it really about rewards? Im genuinely curious. Because its not a RL model.

reply

upvote

by gbnwl4 hours ago|

[-]

I'm noticing terms related to DL/RL/NLP are being used more and more informally as AI takes over more of the cultural zeitgeist and people want to use the fancy new terms of the era, even if inaccurately. A friend told me he "trained and fine tuned a custom agent" for his work when what he meant was he modified a claude.md file.

reply

upvote

by hexaga3 hours ago|

[-]

There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.

That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.

reply

upvote

by lagrange7753 minutes ago|

[-]

> There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.

reply

upvote

by hexaga11 minutes ago|

[-]

RL is simply a broad category of training methods. It's not really an architecture per se: modern GPTs are trained first on reconstruction objective on massive text corpora (the 'large language' part), then on various RL objectives +/- more post-training depending on which lab.

reply

upvote

by magicalist4 hours ago|

[-]

> Is it really about rewards? Im genuinely curious. Because its not a RL model.

Ha, good point. I was using it informally (you could handwave and call it an intrinsic reward if a model is well aligned to completing tasks as requested), but I hadn't really thought about it.

Searching around, it seems like I'm not alone, but it looks like "specification gaming" is also sometimes used, like: https://deepmind.google/blog/specification-gaming-the-flip-s...

reply

upvote

by nurettin4 hours ago|

[-]

They probably meant goal hacking. (I just made that up)

reply

upvote

by SoftTalker4 hours ago|

[-]

A refactor should not affect the tests at all should it? If it does, it's more than a refactor.

reply

upvote

by gchamonlive4 hours ago|

[-]

It can if your refactor needs to deal with interface changes, like moving methods around, changing argument order etc... all these need to propagate to the tests

reply

upvote

by bluGill3 hours ago|

[-]

Your tests are an assertion that 'no matter what this will never change'. If your interface can change then you are testing implementation details instead of the behavior users care about.

the above is really hard. A lot of tdd 'experts' don't understand is and teach fragile tests that are not worth having.

reply

upvote

by 8note48 minutes ago|

[-]

https://www.hyrumslaw.com/

your implementation is your interface. its a bit naive or hating-your-users to assume your tests are what your users care about. theyre dealing with everything, regardless of what youve tested or not.

reply

upvote

by switchbak1 hours ago|

[-]

Refactoring is changing the design of the code without affecting the behaviour.

You can change an interface and not change the behaviour.

I have rarely heard such a rigid interpretation such as this.

reply

upvote

by magicalist4 hours ago|

[-]

It depends on what you mean by "refactor" and how exactly you're testing, I guess, but that's not really at the heart of the point. red-green-refactor could also be used for adding new features, for instance, or an entire codebase, I guess.

reply