As long as it's giving the right outputs, who cares what's in latent space?
If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?
Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...
That's a lot of harmless people walking around with crazy thoughts...
A lot of people are walking around with crazy thoughts. Some of them harm.
Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)
[1] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...
Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.