It's worse with renaming things in code. I've yet to see an agent be able to use refactoring tools (if they even exist in VS Code) instead of brute-forcing renames with string replacement or sed. Agents use edit -> build -> read errors -> repeat, instead of using a reliable tool, and it burns a lot more GPU...
When using codex, I usually have something like `Never add 3rd party libraries unless explicitly requested. When adding new libraries, use `cargo add $crate` without specifying the version, so we get the latest version.` and it seems to make this issue not appear at all.
Though that is, at least to me, a bit of an anti-pattern for exactly that reason. I've found it far more successful to blow away the context and restart with a new prompt from the old context instead of having a very long running back-and-forward.
Its better than it was with the latest models, I can have them stick around longer, but it's still a useful pattern to use even with 4.6/5.3
That's their strategy for everything the training data can't solve. This is the main reason the autonomous agent swarm approach doesn't work for me. 20 bucks in tokens just obliterated with 5 agents exchanging hallucinations with each-other. It's way too easy for them to amplify each other's mistakes without a human to intervene.
For the second, I totally agree. I continue to hope that agents will get better at refactoring, and I think using LSPs effectively would make this happen. Claude took dozens of minutes to perform a rename which Jetbrains would have executed perfectly in like five seconds. Its approach was to make a change, run the tests, do it again. Nuts.
Think about what a developer would do: - check the latest version online; - look at the changelog; - evaluate if it’s worth to upgrade or an intermediate may be alright in case of code update are necessary;
Of course, the keep these operations among the human ones, but if you really want to automate this part (and you are ready to pay its consequences) you need to mimic the same workflow. I use Gemini and codex to look for package version information online, it checks the change logs from the version I am to the one I’d like to upgrade, I spawn a Claude Opus subagent to check if in the code something needs to be upgraded. In case of major releases, I git clone the two packages and another subagents check if the interfaces I use changed. Finally, I run all my tests and verify everything’s alright.
Yes, it might not still be perfect, but neither am I.
The AI hasn't understood what's going on, instead it has pattern matched strings and used those patterns to create new strings that /look/ right, but fail upon inspection.
(The human involved is also failing my Turing test... )