> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)
Collectively, they are known as known as GREEDI-BULLSHIT.
Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
January was an inflection point, and no open weights model has crossed over that same threshold.
This is definitely recursive self improvement territory, except that we're prohibited from participating.
It feels like the capability gap is wider than before.
The threshold has definitely been crossed.
Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
A statement like this, clearly, requires a reference.
Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?
Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.
Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.
My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.
See: