undefined

points

[-]

Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?

If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.

by yorwba1 hours ago|

parent|

[-]

The verbalizer and reconstruction models are both initially finetuned on LLM output from a summarization prompt. The resulting text is not completely unrelated, but mostly wrong: https://transformer-circuits.pub/2026/nla/png/img_18fcfc16e9... The reconstructed activations are also far from matching the verbalizer's input. It's not unusual in machine learning to have results that are shit and SOTA at the same time, simply because there's no other technique that works better.

by mike_hearn2 hours ago|

prev|

[-]

It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM.

by psb2171 hours ago|

parent|

[-]

It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).

by 3 hours ago|

prev|

[-]

deleted

by astrange9 hours ago|

prev|

[-]

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.

by skybrian7 hours ago|

parent|

[-]

But that's not how the training works. Goodhart's law isn't magic.

The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.

Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.

by NiloCK4 hours ago|

parent|

[-]

Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?

by elil173 hours ago|

parent|

[-]

How the hell would a model training run "defend against" this approach? What would that even mean?

by red75prime7 hours ago|

parent|

prev|

[-]

The obvious fix is to make interpretation of itself a part of the model (like we can explicitly introspect to a certain extent what the brain is doing). Misinterpretation of itself, hopefully, would decrease the system's performance on all tasks and it would be rooted out by training. Of course, it doesn't mean that the fix is easy to implement and that it doesn't have other failure modes.

by lern_too_spel7 hours ago|

prev|

[-]

Yeah, I don't see how this text can be trusted at all. Any invertible function from activation space to text will optimize the loss function, including text that says the complete opposite of what the activations mean.

by NiloCK3 hours ago|

parent|

[-]

Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in.

It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.