upvote
Something like a textual steganography?

Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

reply
This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
reply
That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.

reply