undefined

points

[-]

I tell people to treat LLM's like a toddler (albeit a very capable toddler).

Do kids learn well when you only tell them what NOT to do? Of course not! You should be explaining how to do things correctly, and most importantly the WHY, as well as providing examples of both the "correct" and "incorrect" ways (also explaining why an example is incorrect).

by bostik6 hours ago|

parent|

[-]

The best way to describe AI agents I've heard: treat them as hostages that will do anything to appease their captor.

They have a vast latent knowledge base, infinite patience and zero capacity for making personal judgement calls. You give one a goal and it will try to meet that goal.

by generic920346 hours ago|

parent|

[-]

> The best way to describe AI agents I've heard: treat them as hostages that will do anything to appease their captor.

A scary image, if we consider agents to develop anything like a conscience at some point in time. Of course, with the current approach they never might, but are we so sure?

by palmotea7 hours ago|

parent|

prev|

[-]

> I tell people to treat LLM's like a toddler (albeit a very capable toddler).

Bbbbut a guy from Anthropic, just this last Friday, told me to think of Claude as my "brilliant coworker"! Are you telling me that's not true!?

by boc8 hours ago|

prev|

[-]

LLMs can research what a tool does before calling it though - they'll sniff that one out pretty quick.

I think the better route is to be honest and say that database integrity is a primary foundation of the company, there's no task worth pursuing that would require touching the database, specifically ask it to think hard before doing anything that gets close to the production data, etc.

I run a much lower-stakes version where an LLM has a key that can delete a valuable product database if it were so inclined. I've built a strong framework around how and when destructive edits can be made (they cannot), but specifically I say that any of these destructive commands (DROP, -rm, etc) need to be handed to the user to implement. Between that framework and claude code via CLI, it's very cautious about running anything that writes to the database, and the new claude plan permissions system is pretty aggressive about reviewing any proposed action, even if I've given it blanket permission otherwise.

I've tested it a few times by telling it to go ahead, "I give you permission", but it still gets stopped by the global claude safety/permissions layer in opus 4.7. IMO it's pretty robust.

Food for thought.

by not_kurt_godel7 hours ago|

parent|

[-]

> specifically ask it to think hard before doing anything that gets close to the production data

This is recklessly negligent and I would personally not tolerate a coworker or report doing it. What's next, sending long-lived access tokens out over email and asking pretty please for nobody to cc/forward?

by EagnaIonat6 hours ago|

parent|

prev|

[-]

> specifically ask it to think hard before doing anything that gets close to the production data, etc.

Standard rule is you never let your developers at the production instance. So I can't see why an LLM would get a break.

by Jean-Papoulos7 hours ago|

parent|

prev|

[-]

"I've put enough safety around the bomb that the bomb is worth using. The other people that exploded just didn't have enough safety but I do !"

by kamaal7 hours ago|

parent|

prev|

[-]

>>LLMs can research what a tool does before calling it though

Thats stretching the definition of 'research', it basically checks if the texts are close enough.

Delete can occur in various contexts, including safe contexts. It simply checks if a close enough match is available and executes. It doesn't know if what it is doing is safe.

Unfortunately a wide variety of such unsafe behaviours can show up. I'd even say for someone that does things without understanding them. Any write operation of any kind can be deemed unsafe.

by yowlingcat7 hours ago|

prev|

[-]

It's been a very strange realization to have with AI lately (which you have reminded me of) because it also reminds me that the same thing works with humans. Not the killing part at least, but the honeypot and jailing/restricting access part.

Probably because telling someone not to do something works the 99% of the time they weren't going to do it anyways. But telling somebody "here's how to do something" and seeing them have the judgment not do it gives you information right away, as does them actually taking the honeypot. At the heart of it, delayed catastrophic implosions are much worse than fast, guarded, recoverable failures. At the end of the day, I suppose that's been supposed part of lean startup methodology forever -- just always easy in theory and tricky in practice I suppose.