What kind of setup do you use ? Can you share ? How much does it cost ?
It works wonderfully well. Costs about $200USD per developer per month as of now.
(I built it)
And do you have any prompts to share?
* There is a lot of duplication between A & B. Refactor this.
* Look at ticket X and give me a root cause
* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds
Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc
To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.
I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.
How do you know that it works then? Are you using a different tool that does support it?
Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.
So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.