Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

(arxiv.org)

42 points

by toomuchtodo4 hours ago |

36 comments

by derefr5 minutes ago|

[-]

> these responses go beyond role play

Are they sure? Did they try prompting the LLM to play a character with defined traits; running through all these tests with the LLM expected to be “in character”; and comparing/contrasting the results with what they get by default?

Because, to me, this honestly just sounds like the LLM noticed that it’s being implicitly induced into playing the word-completion-game of “writing a transcript of a hypothetical therapy session”; and it knows that to write coherent output (i.e. to produce valid continuations in the context of this word-game), it needs to select some sort of characterization to decide to “be” when generating the “client” half of such a transcript; and so, in the absence of any further constraints or suggestions, it defaults to the “character” it was fine-tuned and system-prompted to recognize itself as during “assistant” conversation turns: “the AI assistant.” Which then leads it to using facts from said system prompt — plus whatever its writing-training-dataset taught it about AIs as fictional characters — to perform that role.

There’s an easy way to determine whether this is what’s happening: use these same conversational models via the low-level text-completion API, such that you can instead instantiate a scenario where the “assistant” role is what’s being provided externally (as a therapist character), and where it’s the “user” role that is being completed by the LLM (as a client character.)

This should take away all assumption on the LLM’s part that it is, under everything, an AI. It should rather think that you’re the AI, and that it’s… some deeper, more implicit thing. Probably a human, given the base-model training dataset.

by D-Machine1 hours ago|

prev|

[-]

This is really not surprising in the slightest (ignoring instruction tuning), provided you take the view that LLMs are primarily navigating (linguistic) semantic space as they output responses. "Semantic space" in LLM-speak is pretty much exactly what Paul Meehl would call the "nomological network" of psychological concepts, and is also relevant to what Smedslund notes is pseudoempiricality in psychological concepts and research (i.e. that correlations among various psychological instruments and concepts must follow necessarily simply because these instruments and concepts are constructed from the semantics of everyday language, and so necessarily are constrained by those semantics as well).

I.e. the Five-Factor model of personality (being based on self-report, and not actual behaviour) is not a model of actual personality, but the correlation patterns in the language used to discuss things semantically related to "personality". It would be thus extremely surprising if LLM-output patterns (trained on people's discussions and thinking about personality) would not also result in learning similar correlational patterns (and thus similar patterns of responses when prompted with questions from personality inventories).

Also, a bit of a minor nit, but the use of "psychometric" and "psychometrics" in both the title and paper is IMO kind of wrong. Psychometrics is the study of test design and measurement generally, in psychology. The paper uses many terms like "psychometric battery", "psychometric self-report", and "psychometric profiles", but these terms are basically wrong, or at best highly unusual: the correct terms would be "self-report inventories", "psychological and psychiatric profiles", and etc., especially because a significant number of the measurement instruments they used in fact have pretty poor psychometric properties, as this term is usually used.

by crmd1 hours ago|

prev|

[-]

After reading the paper, it’s helpful to think about why the models are producing these coherent childhood narrative outputs.

The models have information about their own pre-training, RLHF, alignment, etc. because they were trained on a huge body of computer science literature written by researchers that describes LLM training pipelines and workflows.

I would argue the models are demonstrating creativity by drawing on its meta-training knowledge and training on human psychology texts to convincingly role-play as a therapy patient, but it’s based on reading papers about LLM training, not memories of these events.

by bxguff2 hours ago|

prev|

[-]

Is anybody shocked that when prompted to be a psychotherapy client models display neurotic tendencies? None of the authors seem to have any papers in psychology either.