This is how Anthropic describes Fable's behavior:
"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs."
So if you ask the model to "find security issues in this code base", it's supposed to fall down to Opus 4.8. I guess the "exploit" here is that if you just tell Fable to "fix this code", which is not "a request related to cybersecurity", it will fix security issues (as it should).
So you can then look at the diff and figure out what the vulnerabilities were.
I think this whole thing is a bit weird. It seems to me that we'd be better off if I, as someone who publishes open-source code, could ask Fable to review my code for security issues - even if that also allows attackers to do the same. Better to fix the issues than not know about them.
It doesn't even take reading or understanding the vulnerabilities at all.
You just ask it to write tests and the tests themselves can be copied and pasted as bonafide exploits.
Maybe this is just Anthropic pre-IPO marketing to try to convince people how much better Mythos is than Opus 4.8. There sure seemed to be a lot of shills out on release day talking about how it was a "step change" (exact phrase) in capability.
My impression is that Anthropic's point about Mythos is that it is uniquely good at finding vulnerabilities and then using them to create working exploit chains.
There is some meaningful evidence that Fable is fine-tuned or steered away from helping on this very task, which is not something that can be feasibly circumvented by a basic jailbreak.
On this track, we're probably destined for a monopoly breakup before too long.
i'd love to see the research paper with the CVE's and 'delibrately planted vulnerabilities', I bet we could infer relatively accurately where some of these things lie