undefined

points

[-]

It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.

If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).