https://alignment.openai.com/argo/ (finding what the reward models are actually encouraging) https://alignment.openai.com/sae-latent-attribution/ (what model features drive specific behaviours, presumably this would be great for goblin hunts) https://alignment.openai.com/helpful-assistant-features/ (how high level misaligned personality shows up when fine-tuning on bad advice).
It's weird that the goblin post doesn't seem to draw upon these tools.
Anthropic's recent emotions paper shows how broad the functional emotions are, even finding specific emotions firing before cheating (!): https://transformer-circuits.pub/2026/emotions/index.html
I hope their alignment researchers aren't too annoyed by the Goblin post, it seems oddly siloed!
Goes to show it's all vibes when making these models. The fix is literally a prompt that says not to talk about goblins...
> We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins.
The prompt is just a short term hotfix/hack because they couldn’t get the proper fix in in time.
If you need to put baby guardrails on your model because the training is effed up, maybe you should rethink how you make these models and how much control you really have on it.
I propose "Goblin Hunter"
(if ever goblins turn out to be an actual species, I apologize for this prebigotry)