undefined

points

[-]

It's a funny detail to skim, but what's more surprising is how mechanistic interpretability and alignment science have much better tools and research than the goblin blog post suggests, including from OpenAI's own alignment team:

https://alignment.openai.com/argo/ (finding what the reward models are actually encouraging) https://alignment.openai.com/sae-latent-attribution/ (what model features drive specific behaviours, presumably this would be great for goblin hunts) https://alignment.openai.com/helpful-assistant-features/ (how high level misaligned personality shows up when fine-tuning on bad advice).

It's weird that the goblin post doesn't seem to draw upon these tools.

Anthropic's recent emotions paper shows how broad the functional emotions are, even finding specific emotions firing before cheating (!): https://transformer-circuits.pub/2026/emotions/index.html

I hope their alignment researchers aren't too annoyed by the Goblin post, it seems oddly siloed!

by Orygin4 hours ago|

prev|

[-]

It didn't seem that deep to me. They just saw an issue with Goblins, dissected the word from the model, then it appeared again in the next version without them knowing exactly how or why.

Goes to show it's all vibes when making these models. The fix is literally a prompt that says not to talk about goblins...

by meken2 hours ago|

parent|

[-]

I’m not sure how that was your takeaway..?

> We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins.

The prompt is just a short term hotfix/hack because they couldn’t get the proper fix in in time.

by Orygin1 hours ago|

parent|

[-]

Then maybe stop training and make a real fix?

If you need to put baby guardrails on your model because the training is effed up, maybe you should rethink how you make these models and how much control you really have on it.

by alansaber4 hours ago|

prev|

[-]

This is a little bit too whimsical for me, but distributed model training across thousands of GPUs has the potential to introduce lots of little quirks that are impossible to exactly source

by Razengan6 hours ago|

prev|

[-]

> The quanta article referenced at [1] used the term "Anthropologist of Artificial Intelligence"

I propose "Goblin Hunter"

(if ever goblins turn out to be an actual species, I apologize for this prebigotry)

by gizajob5 hours ago|

parent|

[-]

AI Goblinologist.