Usually things go smoothly but sometimes I have situations like: “please add feature X, needs to have ABCD.” -> does ABC correct but D wrong -> “here is how to fix D” -> fixes D but breaks AB -> “remember I also want AB this way, you broke it” -> fixes AB but removes C and so on
What's been working for me is keeping a CLAUDE.md file in my project root with key decisions and context. The model reads it at the start of every session so I don't have to re-explain everything. Not as elegant as automated compaction but it works.
I generate task.md files before working on anything, some are short, others are super long and with many steps. The models don't deviate anymore. One trick is to make a post tool use hook to show the first open gate "- [ ]" line from task.md on each tool call. This keeps the agent straight for 100s of gates.
After each gate is executed we don't just check it, we also append a few words of feedback. This makes the task.md become a workbook, covering intent, plan, execution and even judgements. I see it like a programming language now. I can gate any task and the agent will do it, however many steps. It can even generate new gates, or replan itself midway.
You can enforce strict testing policies by just leaning into gate programability power - after each work gate have a test gate, and have judges review testing quality and propose more tests.
The task.md file is like a script or pipeline. It is also like a first class function, it can even ingest other task.md files for regular reflexion. A gate can create or modify gates, or tasks. A task can create or modify gates or tasks.