upvote
yeah, I'm confused as well, why would the models hold any memory about red teaming attempts etc? Or how the training was conducted?

I'm really curious as to what the point of this paper is..

reply
Gemini is very paranoid in its reasoning chain, that I can say for sure. That's a direct consequence of the nature of its training. However the reasoning chain is not entirely in human language.

None of the studies of this kind are valid unless backed by mechinterp, and even then interpreting transformer hidden states as human emotions is pretty dubious as there's no objective reference point. Labeling this state as that emotion doesn't mean the shoggoth really feels that way. It's just too alien and incompatible with our state, even with a huge smiley face on top.

reply
I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them.
reply
Are we sure there isn't some company out there crazy enough to feed all it's incoming prompts back into model training later?
reply