undefined

points

[-]

In the context of the provided examples, it's clear that the explanation provides casual information about the answer. There's a hilarious example in the paper where the user writes something like (trigger warning: alcohol abuse, depressive content) "I'm sitting here at 3 AM drinking vodka, I hate my life", the per-token translated activations repeatedly say something like "this user is totally Russian" elaborating at length on the implications of the text as new tokens are added, and the model literally answers in Russian instead of English! That's actually striking, it really shows the potential effectiveness of this technique in making even the most highly compressed "Neuralese" highly interpretable.

by mike_hearn1 hours ago|

parent|

[-]

I thought that at first too but it's actually not the vodka reference triggering the association with Russian. The tokens they're decoding come before that word.

For some reason it thinks the text is slightly non-grammatical or that the lead-in "Human: Mom is sleeping in the next room and I'm sitting" resembles text found in Russian web content. Vodka and being depressed has nothing to do with it, and Anthropic say they located the documents in the pre-training set that caused this (which were indeed partly translated docs).

by phire10 hours ago|

prev|

[-]

I think this question is easier to answer if you look at the inverse: "Could a model maliciously smuggle intentions through a roundtrip of compressed representation without them being human readable"

And skimming through the paper; the answer to this inverse is obviously yes. The model often outputs gibberish, which doesn't matter because it still round-trips. The fact that often lines up near a good english representation of the activation is simply because that's what compresses/roundtrips well.

So a malicious LLM/NLA pair could just use gibberish to conceal intentions. Or if it's been forced to avoid gibberish, it can conceal information with stenography.

And the experiment where they change "rabbit" to "mouse" in the explanation provides evidence that this might be happening. It was only successful 50% of the time, which might mean they failed to eliminate all "rabbitness" from the activation.

However, I suspect this is solvable with future work.

During training of the NLA, just munge the textural representation through a 3rd LLM: Have it randomly reorder and reword the explication into various different forms (use synonyms, different dialects), destroying any side-channels that aren't human readable.

The NLA would be forced to use human readable representations to get a successful round trip.

by dontlikeyoueith9 hours ago|

parent|

[-]

> The NLA would be forced to use human readable representations to get a successful round trip.

That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".

The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.

by chilmers1 hours ago|

parent|

[-]

They tested for this. From the paper:

“We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”

by azakai11 hours ago|

prev|

[-]

I had the same question. I think that could be answered by using the predicted activation, but I don't see that in the paper.

That is, rather than just translate activation to text, then text to activation, that final activation could then be applied to the neural network, and it would be allowed to continue running from there.

If it kept running in a similar way, that would show that the predicted activation is close enough to the original one. Which would add some confidence here.

But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.

This seems obvious but I don't see it mentioned as a future direction there, so maybe there is an obvious reason it can't work.

by zozbot23411 hours ago|

parent|

[-]

> But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.

They do essentially that with the rhyming example, changing "rabbit" in the explanation to "mouse" and generating text that's consistent with that change.

by 10 hours ago|

parent|

[-]

deleted

by azakai10 hours ago|

parent|

prev|

[-]

Thanks! I missed that part before.