upvote
You can see their general approach to guardrail classifiers in these posts:

https://www.anthropic.com/research/constitutional-classifier... https://www.anthropic.com/research/next-generation-constitut...

It's not just keyword matching, but I'm sure they tuned the Fable classifiers pretty hard to avoid false negatives.

reply
But you think that Anthropic of all companies would realize this, so why did they do it that way? Did they literally take the first suggestion Mythos gave them to add these guardrails - wouldn't be surprising, seeing the state of the leaked Claude Code codebase.
reply