Except your ceiling can and will fall on you unless you take preventative measures, entirely due to molecular interactions within the material.
Barring that, it is entirely possible and even quite likely that your ceiling will collapse on you or someone else some time in the future.
It boggles the mind to let an LLM have access to a production database without having explicit preventative measures and contingency plans for it deleting it.
The LLM agent is very good at fulfilling its objective and it will creatively exploit holes in your specification to reach its goals. The evals in the System Cards show that the models are aware of what they're doing and are hiding their traces. In this example the model found an unrelated but working API token with more permissions the authors accidentally stored and then used that.
Without regulation on AI safety, the race towards higher and higher model capabilities will cause models to get much better at working towards their goals to the point where they are really good at hiding their traces while knowingly doing something questionable.
It's not hard to imagine that when we have a model with broadly superhuman capabilities and speed which can easily be copied millions of times, one bad misspecification of a goal you give to it will lead to human loss of control. That's what all these important figures in AI are worried about: https://aistatement.com/
I don't mean that you personally have taken those measures, but preventative measures have absolutely been taken. When they aren't, ceilings collapse on people.
See any sheetrock ceiling with a leak above it. Or look at any abandoned building: they will eventually always have collapsed floors/ceilings. It is inevitable.
Entropy may mean all ceilings collapse eventually, but that doesn't mean we aren't able to make useful ceilings.
They're only sharing an annecdote because they are responding to your annecdote about not seeing a ceiling collapse.
> I don't think it changes the point of the metaphor.
If their anecdotes is moot, than your anecdote is also moot; if the anecdotes can only confirm a conclusion and never disconfirm, then we've created an unfalsifiable construction with the conclusion baked into it's premises.
A person who better comprehends what they read might properly contextualize within the larger conversation, where the point that stands is that LLMs and ceilings are both useful, neither are doomed such that no one should use them, and that individual instances of failures are somewhat uncommon and not a reason for others to avoid the category.
I'm going to be frank, you are the person who misunderstands (and are being rather rude about it). You are responding to an argument no one is making.
To put a fine point on it, you said this:
> Entropy may mean all ceilings collapse eventually, but that doesn't mean we aren't able to make useful ceilings.
But you were responding to a comment saying this:
> Except your ceiling can and will fall on you unless you take preventative measures, entirely due to molecular interactions within the material.
Emphasis added. They are saying maintenance is necessary, not that a safe ceiling is unachievable. It's obviously achievable, we've all seen it achieved.
They further say:
> It boggles the mind to let an LLM have access to a production database without having explicit preventative measures and contingency plans for it deleting it.
Emphasis added. When they say it boggles the mind to deploy an LLM without the proper measures, the implication is that it does make sense to deploy it with the proper measures.
> ...the point that stands is that LLMs and ceilings are both useful, neither are doomed such that no one should use them, ...
I have not seen a single person in this subthread say that LLMs aren't useful or that they are doomed. People say that. But the people you're talking to haven't.
I try to avoid these petty "I brought the receipts" comments, but I don't like the way you're being snarky to people who's crime is engaging with the premises you set up. The faults you are finding are faults you introduced. I'd appreciate if you would avoid that in the future.
If you want to take a comb to it, the comment saying this:
> Except your ceiling can and will fall on you unless you take preventative measures, entirely due to molecular interactions within the material
Was already off the plot. What was being discussed wasn't some specific molecular process, it was the false premise "oh molecules move around randomly so your ceiling might just collapse of its own accord because the beam decided to randomly disintegrate". That's not something that happens.
You said "The sequence of tokens that would destroy your production environment can be produced by your agent, no matter how much prompting you use". This is analogous to "the ceiling could just collapse on you due to random molecular motion, no matter how much maintenance you do or what materials you use".
Make sense now?
Your edit at the bottom of your top comment does better than your original statement.
Except it does happens. That’s why buildings get condemned and buildings eventually turn to rubble.
To the exact point; I have a product from a couple years ago using an old model from OpenAI. It’s still running and all it does is write a personality report based on scores from the test. I can’t update the model without seriously rewriting the entire prompt system, but the model has degraded over the years as well. Ergo, my product has degraded of its own accord and there is nearly nothing I can do about it. My only choice is to basically finagle newer models into giving the correct output; but they hallucinate at much higher rates than older models.
I'd encourage to desist from rudeness, not just when people point it out to you, but at all times.
> You said "The sequence of tokens that would destroy your production environment can be produced by your agent, no matter how much prompting you use". This is analogous to "the ceiling could just collapse on you due to random molecular motion, no matter how much maintenance you do or what materials you use".
If prompt engineering is effective (analogous to performing the necessary maintenance and selecting the correct materials), I'm curious what your explanation is for the incident in the article?
I desire neither to be inauthentic, nor to suppress my emotions.
> If prompt engineering is effective (analogous to performing the necessary maintenance and selecting the correct materials), I'm curious what your explanation is for the incident in the article?
Keeping with the analogies, the original article doesn't say whether they built the roof properly or if the just used some screws to hold up a piece of quarter inch plywood and called it a day.
It's no surprise that a terribly built roof may fall down. It's possible to get shoddy materials from a supplier without knowing.
Calling a curl command isn't something that would be within the model's training as "this deletes things don't do it". The fact that this happened is not, to me, evidence that the model might have equally run `sudo rm -rf --no-preserve-root /` under similar circumstances.
It sounds like the phrase "NEVER FUCKING GUESS!" was in the prompt as well, which could easily encourage the model towards "be sure of yourself, take action" instead of the "verify" that was meant.
As mentioned elsewhere in this thread, the fact that the article focuses so strongly on "the model confessed! It admitted it did the wrong thing!" doesn't lead me to put a ton of stock into the capability of the author to be cautious.
I guess the question is, since we know these things can happen, however unlikely, what mitigations should be in place that are commensurate with the harms that might result?
This isn't a defence of using LLMs like this, but this statement taken at face value is a source of a lot of terrible things in the world.
This is the kind of stuff that leads to a world where kids are no longer able to play outside.
And I do think it's stupid to wire an LLM to a production database. Modern LLMs aren't that reliable (at least not yet), and the cost-benefit tradeoff does not make sense. (What do you even gain by doing that?)
However, you can't just look at that and say "Duh, this setup is bound to fail, because LLMs can generate every arbitrary sequence of tokens." That's a wrong explanation, and shows a misunderstanding of how LLMs (and probability) work.
LLM generating each token probabilistically does not mean there's a realistic chance of generating any random stuff, where we can define "realistic" as "If we transform the whole known universe into data centers and run this model until the heat death of the universe, we will encounter it at least once."
Of course that does not mean LLMs are infallible. It fails all the time! But you can't explain it as a fundamental shortcoming of a probabilistic structure: that's not a logical argument.
Or, back to the original discussion, the fact that this one particular LLM generated a command to delete the database is not a fundamental shortcoming of LLM architecture. It's just a shortcoming of LLMs we currently have.
In distributional language modeling, it is assumed that any series of tokens may appear and we are concerned with assigning probabilities to those sequences. We don't create explicit grammars that declare some sequences valid and others invalid. Do you disagree with that? Why?
No matter how much prompting you give the agent, it does not eliminate the possibility that it will produce a dangerous output. It is always possible for the agent to produce a dangerous output. Do you disagree with that? Why?
The only defensible position is to assume that there is no output your agent cannot produce, and so to assume it will produce dangerous outputs and act accordingly. Do you disagree with that? Why?
And it's good that we can think that way, because we also follow the rules of statistical and quantum physics, which are inherently probabilistic. So, basically, you can say the same things about people. There's a nonzero (but extremely small) probability that I'll suddenly go mad and stab the next person. There's a nonzero (but even smaller) probability that I'll spontaneously erupt into a cloud of lethal pathogen that will destroy humanity. Yada yada.
Yet, nobody builds houses under the assumption that one of the occupants would transform into a lethal cloud, and for good reason.
Yes, it does sound a bit more absurd when we apply it to humans. But the underlying principle is very similar.
(I think this will be my last comment here because I'm just repeating myself.)
If this is our only point of disagreement, then we don't actually disagree. I understand "strong engineering control" to mean "something that reduces incidence of a failure mode to an acceptable level".
Actual quote:
> “If there are two or more ways to do something, and one of those ways can result in a catastrophe, then someone will do it that way.”
My experience is that everyone thinks their defensive controls are air tight until inevitably they're going through a post-mortem on a failure where someone says, "Whelp...Murphy's Law..."
Your phrasing is right.
I was just doing a quick take on this qualifier:
> which is not prevented by a strong engineering control
I'd be interested in hearing this argument.
To address your chemistry example; in the same way that there is a process (the averaging of many random interactions) that leads to a deterministic outcome even though the underlying process is random, a sandbox is a process that makes an agent safe to operate even though it is capable of producing destructive tool calls.
But it may be a bad mental model in other contexts, like debugging models. As an extreme example models is that collapse during training become strictly deterministic, eg a language model that always predicts the most common token and never takes into account it's context.
Across all runs, any sequence can be generated, and potentially scored highly.
Thus, any sequence can eventually be selected.
The probability that an ideal, continuous LLM would output a 0 for a particular token in it's distribution is itself 0. The probability that an LLM using real floating point math isn't terrifically higher than 0.
There is a piece of knowledge you seem to be missing. Yes, a transformer will output a distribution over all possible tokens at a given step. And none of these are indeed zero, but always at least larger than epsilon.
However, we usually don't sample from that distribution at inference time!
The common approach (called nucleus sampling or also known as top-p sampling) will look at the largest probabilities that make up 95% of the probability mass. It will set all other probabilities to zero, renormalize, and then sample from the resulting probability distribution. There is another parameter `top-k`, and if k is 50, it means that you zero out any token that is not in the 50 most likely tokens.
In effect, it means that for any token that is sampled, there is usually really only a handful of candidates out of the thousands of tokens that can be selected.
So during sampling, most trajectories for the agent are literally impossible.
So I want you to understand this. You are basically selling heroin to junkies and then acting like the consequences aren't in any way your fault. Management will far too often jump at false promises made by your execs. Your technology is inherently non-deterministic. Therefore your promises can't be true. Yet you are going to continue being part of a machine that destroys businesses and lives. Please at least act like you understand this.
I mean, I do?
Some of the best known laws from the ~1700BC Babylonian legal text, The Code of Hammurabi, are laws 228-233, which deal with building regulations.
229. If a builder builds a house for a man and does not make its construction firm, and the house which he has built collapses and causes the death of the owner of the house, that builder shall be put to death.
230. If it causes the death of the son of the owner of the house, they shall put to death a son of that builder.
233. If a builder constructs a house for a man but does not make it conform to specifications so that a wall then buckles, that builder shall make that wall sound using his silver (at his own expense).
That doesn’t sound like ceilings never disintegrated!