undefined

points

[-]

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

by woah4 hours ago|

parent|

[-]

For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want

by gamegoblin1 hours ago|

parent|

prev|

[-]

I use Codex CLI or Claude Code

I don't even necessarily ask it to fix the bug — just identify the bug

Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.

by zem2 hours ago|

parent|

prev|

[-]

it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.

by p1esk4 hours ago|

parent|

prev|

[-]

“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

by SatvikBeri1 hours ago|

parent|

[-]

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

by vel0city3 hours ago|

parent|

prev|

[-]

You do things like ralph loops.

https://github.com/snarktank/ralph

Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

by tsss5 hours ago|

prev|

[-]

How can you afford that?

by wahnfrieden5 hours ago|

parent|

[-]

It costs $200 for a month

by addaon4 hours ago|

prev|

[-]

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

by arcanemachiner4 hours ago|

parent|

[-]

Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.