But if you go beyond what can be tested easily, asking the agent to do real work rather than writing a patch, imagining things to be true is a problem.
Coding could be treated as a low stakes (time & money consequences for retries) closed loop system where most other tasks cannot.
If it screws up booking your flight/hotel room, how does the agent verify this, and even if it verifies.. there is an actual cost to changes/cancellations.
Similar with agentic e-commerce, lots of ability to screw that up and just seems ripe for fraud / being picked off by bad actors.
Unfortunately, travel keeps getting less flexible, with worse cancelation policies.
I can STILL replicate this behavior in Google AI summaries 10% of the time:
"is <SOMEPLANT> ok for cats"
to which it replies: "Yes, <SOMEPLANT LONG SCIENTIFIC NAME VERBOSE PHRASING> is toxic for cats"
The other one going around this weekend: "how long hot dogs on grill"
Summary: "The hot dogs on your grill are likely around 5-6 inches long .. "
So scale this category of error to unsupervised agents with access to your credit card.
Only with an LLM that's actually at agent-quality.
If "useful chatbot" and "useful agent" are two rungs on a ladder, the rung before them is "useful autocomplete". Autocomplete that only gets the next token right 90% of the time won't give you compiling code.