Bit it can still be useful, as long as you interpret it as "which stimuli most likely triggered the behaviour?" You can't trust it uncritically, but models do sometimes pinpoint useful things about how they were prompted.
The real meaning of accountability is that you can fire one if you don't like how they work. Good news! You can fire an AI too.
It's similarly reasonable to drop a tool that's unreliable, though I don't think that's a reasonable description here. Instead, they used a tool which is generally known to be unpredictable and failed to sandbox it adequately.
The cold hard fact is: LLMs are an unreliable tool, and using them without checking their every action is extremely foolish.
You mean checking every action of theirs outside the sandbox I suppose? Otherwise any attempt at letting an agent do some work I would consider foolish.
At least for now.
No, we are not born with all the pre-training we need. That is rather the point of education, teaching people's brains how to process information in new, maybe unintuitive ways.
And in the reverse, if a person makes a series of impulsive, damaging decisions, they probably will not be able to accurately explain why they did it, because neither the brain nor physiology are tuned to permit it.
Seems pretty much the same to me.
What do you mean by fire? And how is the accountability similar to an employee?
There is no internal monologue with which to have introspection (beyond what the AI companies choose to hide as a matter of UX or what have you). There is no "I was feeling upset when I said/did that" unless it's in the context.
There is no ghost in the machine that we cannot see before asking.
Even if a model is able to come up with a narrative, it's simply that. Looking at the log and telling you a story.
Maybe. How do you tell? What would you expect to be different if they didn't?
> The LLM literally cannot possibly have a deeper insight into the root cause than the user, because it can only work from the information that the user has access to.
Insight is not solely a function of available input information. Arguably being able to search and extract the relevant parts is a far more important part of having insights.
I think you're asking how I would know if other people were P-zombies. That's an inappropriate question because I didn't talk about subjective experience, just about internal state. There's no question about whether other people have internal states. I can show someone a piece of information in such a way that only they see it and then ask them to prove that they know it such that I can be certain to an arbitrarily high degree that their report is correct.
Unvoiced thoughts are trickier to prove, but quite often they leave their mark in the person's voiced thoughts.
>Insight is not solely a function of available input information. Arguably being able to search and extract the relevant parts is a far more important part of having insights.
LLMs are notoriously bad at judging relevance. I've noticed quite often if you ask a somewhat vague question they try to cold-read you by throwing various guesses to see which one you latch onto. They're very bad at interpreting novel metaphors, for example.
In fact, talking about "thinking" at all is already the wrong direction to go down when trying to triage an incident like this. "Do not anthropomorphize the lawnmower" applies to AI as much as Larry Ellison.
If thinking is the wrong direction to go down, then it is also the wrong direction to go down when talking about humans.
Sometimes I think we're too eager to compare ourselves to them.
But are their explanations for how they behaved any more compelling than those of people who have? If so, why?
LLMs are lacking layers of awareness that humans have. I wonder if achieving comparable awareness in LLMs would require significantly more compute, and/or would significantly slow them down.
I argue that the model has no access to its thoughts at the time.
Split brain experiments notwithstanding I believe that I can remember what my faulty assumptions were when I did something.
If you ask a model “why did you do that” it is literally not the same “brain instance” anymore and it can only create reasons retroactively based on whatever context it recorded (chain of thought for example).
You got the wrong takeaway from your link.
This is falsified by that study, showing that on the frontier models generalized introspection does exist. It isn't consistent, but is is provable.
"no access" vs. "limited access"
You cannot trust that the model has introspection so for all intents and purposes for the end user it doesn't.
I suspect you’re making assumptions that don’t hold up to scrutiny.
You appear to be defaulting to the assumption that LLMs and humans have comparable thought processes. I don't think it's on me to provide evidence to the contrary but rather on you to provide evidence for such a seemingly extraordinary position.
For an example of a difference, consider that inserting arbitrary placeholder tokens into the output stream improves the quality of the final result. I don't know about you but if I simply repeat "banana banana banana" to myself my output quality doesn't magically increase.
It is known that the narrative part of the brain is separate from the decision taking brain. If someone asks you, in a very convincing, persuasive way, why you did something a year ago and you can't clearly remember you did, it can happen that you become positive that you did so anyway. And then the mind just hallucinates a reason. That's a trait of brains.
Yes brains can hallucinate reasons, doesn't mean they always do. If all reasons given were hallucinations then introspection would be impossible, but clearly introspection do help people.
There is no misinformation in what I wrote.
On top of that the agent is just doing what the LLM says to do, but somehow Opus is not brought up except as a parenthetical in this post. Sure, Cursor markets safety when they can't provide it but the model was the one that issued the tool call. If people like this think that their data will be safe if they just use the right agent with access to the same things they're in for a rude awakening.
From the article, apparently an instruction:
> "NEVER FUCKING GUESS!"
Guessing is literally the entire point, just guess tokens in sequence and something resembling coherent thought comes out.
The “agent’s confession” is the least interesting and useful part of the whole saga. Nothing there helps to explain why the disaster happened or what kind of prompting might help avoid it.
The key mistake is accidentally giving the agent the API key, and the key letdown is the lack of capability scoping or backups in the service.
The main lessons I take are “don’t give LLMs the keys to prod” and “keep backups”. Oh, and “even if you think your setup is safe, double-check it!”
The post-hoc reasoning the model produces when you ask "why did you do that" is also just text, and yet that text often matches independent third-party analysis of the same behavior at well above chance. If it really were uncorrelated text-completion, the post-hoc explanation should not align with the actual causes more than randomly. It does, frequently enough that I've stopped using it as evidence the user is naive.
"just outputs text" is doing more work than we acknowledge. The person asking the agent "why did you do that" might be an idiot for expecting anything more than a post-hoc rationalization, but that's exactly what you'd expect from a human too.
It feels like a modern greek tragedy. Man discovers LLMs are untrustworthy, then immediately uses an LLM as his mouthpiece.
Delicious!
Which calls into question if this is even real.
If you can do this and reliably reduce the rate at which it does bad things, then you could reasonably claim that it is aware of meaningful introspection.
> No confirmation step. No "type DELETE to confirm." No "this volume contains production data, are you sure?" No environment scoping. Nothing.
> The agent that made this call was Cursor running Anthropic's Claude Opus 4.6 — the flagship model. The most capable model in the industry. The most expensive tier. Not Composer, not Cursor's small/fast variant, not a cost-optimized auto-routed model. The flagship.
The tropes, the tropes!!
I don't think there's any special introspection that can be done even from a mechanical sense, is there? That is to say, asking any other model or a human to read what was done and explain why would give you just an accounting that is just as fictional.
We can debate philosophy and theory of mind (I’d rather not) but any reasonable coding agent totally DOES consider what it’s going to do before acting. Reasoning. Chain of thought. You can hide behind “it’s just autoregressively predicting the next token, not thinking” and pretend none of the intuition we have for human behavior apply to LLMs, but it’s self-limiting to do so. Many many of their behaviors mimic human behavior and the same mechanisms for controlling this kind of decision making apply to both humans and AI.
When a human asks another human “why did you do X?”, the other human can of course attempt to recall the literal thoughts they had while they did X (which I would agree with you are quite analogous to the LLMs chain of thought).
But they can do something beyond that, which is to reason about why they may have the beliefs that they had.
“Why did you run that command?”
“Because I thought that the API key did not have access to the production system.”
When a human responds with this they are introspecting their own mind and trying to project into words the difference in understanding they had before and after.
Whereas for an agent it will happily include details that are not literally in its chain of thought as justifications for its decisions.
In this case, I would argue that it’s not actually doing the same thing humans do, it is creating a new plausible reason why an agent might do the thing that it itself did, but it no longer has access to its own internal “thought state” beyond what was recorded in the chain of thought.
Humans do this too, ALL THE TIME. We rationalize decisions after we make them, and truly believe that is why we made the decision. We do it for all sorts of reasons, from protecting our ego to simply needing to fill in gaps in our memory.
Honestly, I feel like asking an AI it’s train of thought for a decision is slightly more useful than asking a human (although not much more useful), since an LLM has a better ability to recreate a decision process than a human does (an LLM can choose to perfectly forget new information to recreate a previous decision).
Of course, I don’t think it is super useful for either humans or LLMs. Trying to get the human OR LLM to simply “think better next time” isn’t going to work. You need actual process changes.
This was a rule we always had at my company for any after incident learning reviews: Plan for a world where we are just as stupid tomorrow as we are today. In other words, the action item can’t be “be more careful next time”, because humans forget sometimes (just like LLMs). You will THINK you are being careful, but a detail slips your mind, or you misremember what situation you are in, or you didn’t realize the outside situation changed (e.g. you don’t realize you bumped the keyboard and now you are typing in another console window).
Instead, the safety improvements have to be about guardrails you put up, or mitigations you put in place to prevent disaster the NEXT time you fail to be as careful as you are trying to be.
Because there is always a next time.
Honestly, I think the biggest struggle we are having with LLMs is not knowing when to treat it like a normal computer program and when to treat it like a more human-like intelligence. We run across both issues all the time. We expect it to behave like a human when it doesn’t and then turn around and expect it to behave like a normal computer program when it doesn’t.
This is BRAND NEW territory, and we are going to make so many mistakes while we try to figure it out. We have to expect that if you want to use LLMs for useful things.
That’s a great way of putting it, I’ll remember that one (except when I forget...)
However it cannot do so after the fact. If there's a reasoning trace it could extract a justification from it. But if there isn't, or if the reasoning trace makes no sense, then the LLM will just lie and make up reasons that sound about right.
I think the same thing, but about agents in general. I am not saying that we humans are automata, but most of the time explanation diverges profoundly from motivation, since motivation is what generated our actions, while explanation is the process of observing our actions and giving ourselves, and others around us, plausible mechanics for what generated them.