undefined

points

[-]

How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

by aeyes5 hours ago|

parent|

[-]

What do you mean? The generated script just downloads the sources and runs pyodide: https://github.com/simonw/research/blob/main/cysqlite-wasm-w...

There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.

by andai4 hours ago|

parent|

[-]

Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)

I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)

I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.

(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)

by simonw4 hours ago|

parent|

prev|

[-]

No, not for days - but it churned away on that one for about ten minutes.

I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/

by basilgohar5 hours ago|

parent|

prev|

[-]

Can you share any examples of these one-shot prompts? I've not gotten to the point where I can get those kind of results yet.

by simonw4 hours ago|

parent|

[-]

If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.

by gamegoblin5 hours ago|

prev|

[-]

I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

by nikkwong4 hours ago|

parent|

[-]

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

by woah4 hours ago|

parent|

[-]

For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want

by gamegoblin1 hours ago|

parent|

prev|

[-]

I use Codex CLI or Claude Code

I don't even necessarily ask it to fix the bug — just identify the bug

Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.

by zem2 hours ago|

parent|

prev|

[-]

it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.

by p1esk4 hours ago|

parent|

prev|

[-]

“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

by SatvikBeri1 hours ago|

parent|

[-]

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

by vel0city3 hours ago|

parent|

prev|

[-]

You do things like ralph loops.

https://github.com/snarktank/ralph

Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

by tsss5 hours ago|

parent|

prev|

[-]

How can you afford that?

by wahnfrieden5 hours ago|

parent|

[-]

It costs $200 for a month

by addaon4 hours ago|

parent|

prev|

[-]

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

by arcanemachiner4 hours ago|

parent|

[-]

Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.

by XCSme5 hours ago|

prev|

[-]

Their ability to burn through tokens non-stop for hours, days or weeks without intervention.

by raw_anon_11114 hours ago|

parent|

[-]

You’re mixing up Open AI for Anthropic.

Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.

I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.

by TheMuenster2 hours ago|

prev|

[-]

Can I just say how funny this metric is?

"Our model is so slow and our tokens/second is so low that these tasks can take hours!" is not the advertising they think it is.

by johnfn4 hours ago|

prev|

[-]

The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.

by seunosewa3 hours ago|

parent|

[-]

How did you verify it?

by girvo2 hours ago|

parent|

[-]

Just send it bro

(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)

by wahnfrieden5 hours ago|

prev|

[-]

It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

by nikkwong4 hours ago|

parent|

[-]

I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet

by foobar100003 hours ago|

parent|

[-]

It needs a closed loop.

Strategy -> [ Plan -> [Execute -> FastVerify -> SlowVerify] -> Benchmark -> Learn lessons] -> back to strategy for next big step.

Claude teams and a Ralph wiggum loop can do it - or really any reasonable agent. But usually it all falls apart on either brittle Verify or Benchmark steps. What is important is to learn positive lessons into a store that survives git resets, machine blowups, etc… Any telegram bot channel will do :)

The entire setup is usually a pain to set up - docker for verification, docker for benchmark, etc… Ability to run the thing quickly, ability for the loop itself to add things , ability to do this in worktree simultaneously for faster exploration - and got help you if you need hardware to do this - for example, such a loop is used to tune and custom-fuse CUDA kernels - which means a model evaluator, big box, etc….

by wahnfrieden3 hours ago|

parent|

[-]

I do it easily just by asking Codex

by rcarmo3 hours ago|

parent|

prev|

[-]

well, you can start with https://github.com/rcarmo/go-textile, https://github.com/rcarmo/go-rdp, https://github.com/rcarmo/go-ooxml, https://github.com/rcarmo/go-busybox (still WIP). All of these are essentially SPEC and test-driven and they are all working for me (save a couple of bugs in go-rdp I need to fix myself, and some gaps in the ECMA specs for go-ooxml that require me to provide actual manually created documents for further testing).

I am currently porting pyte to Go through a similar approach (feeding the LLM with a core SPEC and two VT100/VT220 test suites). It's chugging along quite nicely.

by bitwize4 hours ago|

prev|

[-]

PEBKAC