That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".
The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.
“We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”