upvote
I see this as the gap between an general-purpose agent and a coding agent. A coding agent can imagine something to be true, test it, discover that it's wrong, and recover.

But if you go beyond what can be tested easily, asking the agent to do real work rather than writing a patch, imagining things to be true is a problem.

reply
This to me is the big leap from being good at coding to being good at many other tasks.

Coding could be treated as a low stakes (time & money consequences for retries) closed loop system where most other tasks cannot.

If it screws up booking your flight/hotel room, how does the agent verify this, and even if it verifies.. there is an actual cost to changes/cancellations.

Similar with agentic e-commerce, lots of ability to screw that up and just seems ripe for fraud / being picked off by bad actors.

reply
Seems like to make agents safe we need tentative, reversible transactions. How do you set up a travel plan and then review it? How do you modify it later?

Unfortunately, travel keeps getting less flexible, with worse cancelation policies.

reply
To reply to myself here..

I can STILL replicate this behavior in Google AI summaries 10% of the time:

"is <SOMEPLANT> ok for cats"

to which it replies: "Yes, <SOMEPLANT LONG SCIENTIFIC NAME VERBOSE PHRASING> is toxic for cats"

The other one going around this weekend: "how long hot dogs on grill"

Summary: "The hot dogs on your grill are likely around 5-6 inches long .. "

So scale this category of error to unsupervised agents with access to your credit card.

reply
The problem is that with text/code, judgement is hard. Here is what it looks like for physical activity: https://www.youtube.com/shorts/lK7TjujKQLw It's hard to see how that it's not useful at best and could be a disaster for any unsupervised use.
reply
[flagged]
reply
The gulf is bridgeable. The problem is that a lot of people are building agents without strong enough judgment layers around them. Work that can be verified with reasonable accuracy are the sweet spot right now.
reply
> The gulf is bridgeable.

Only with an LLM that's actually at agent-quality.

If "useful chatbot" and "useful agent" are two rungs on a ladder, the rung before them is "useful autocomplete". Autocomplete that only gets the next token right 90% of the time won't give you compiling code.

reply
How many of these layers are just trying to rediscover/rebuild the idempotence of code?
reply
[flagged]
reply