The full data of what's in an LLM's "consciousness" is the conversation context. Just because it isn't hidden, doesn't necessarily mean it doesn't contain information you've overlooked.
Asking "why did you do that" won't reveal anything new, but it might surface some amount of relevant information (or it hallucinates, it depends which LLM you're using). "Analyse recent context and provide a reasonable hypothesis on what went wrong" might do a bit better. Just be aware that llm hypotheses can still be off quite a bit, and really need to be tested or confirmed in some manner. (preferably not by doing even more damage)
Just because you shouldn't anthropomorphize, doesn't mean an english capable LLM doesn't have a valid answer to an english string; it just means the answer might not be what you expected from a human.
No it's not, see research on hiddens states using SAE's and other methods. TBC, I agree with your second point, though I still believe top level OP was reckless and is now doing the businessman's version of throwing the dog under the bus.
A plausible document that follows the alignment that was done during the training process along with all of the other training where a LLM understanding its actions allows it to perform better on other tasks that it trained on for post training.
It sounds like "we know the LLM understood its actions... because it understood its actions when we trained it", which is circular-logic.
If you ask a human why they did something, the answer is a guess, just like it is for an LLM.
That's because obviously there is no relationship between the mechanisms that do something and the ones that produce an explanation (in both humans and LLMs).
An example of evidence from Wikipedia, "split brain" article:
The same effect occurs for visual pairs and reasoning. For example, a patient with split brain is shown a picture of a chicken foot and a snowy field in separate visual fields and asked to choose from a list of words the best association with the pictures. The patient would choose a chicken to associate with the chicken foot and a shovel to associate with the snow; however, when asked to reason why the patient chose the shovel, the response would relate to the chicken (e.g. "the shovel is for cleaning out the chicken coop").[4]
I can't prove it but this is almost certainly one of those things that is uh, less than universal in the population.
I'm aware of the condition, but let's not confuse failure modes with operational modes. A human with leg problems might use a wheelchair, but that doesn't mean you've cracked "human locomotion" by bolting two wheels onto something.
Also, while both brain-damaged humans and LLMs casually confabulate, I think there's some work to do before one can prove they use the same mechanics.