undefined

upvote

points

by loneboat20 hours ago |

upvote

by vadansky20 hours ago|

[-]

It's from the model card:

> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)

reply

upvote

by DrewADesign19 hours ago|

[-]

Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”

Collectively, they are known as known as GREEDI-BULLSHIT.

reply

upvote

by mwwaters19 hours ago|

[-]

That is for whatever it considers reverse-engineering the model to try to create a competing one.

reply

upvote

by dannyw18 hours ago|

[-]

No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.

Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.

reply

upvote

by kraakf0618 hours ago|

[-]

[dead]

reply

upvote

by 827a18 hours ago|

[-]

It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.

reply

upvote

by _0ffh18 hours ago|

[-]

No, it's not about reverse engineering. It targets ML research.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by mips_avatar20 hours ago|

[-]

They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.

reply

upvote

by HDBaseT20 hours ago|

[-]

Antrophic wants to stop training models and ride out Mythos / Fable for as long as possible.

They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?

reply

upvote

by p-e-w19 hours ago|

[-]

Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.

reply

upvote

by echelon17 hours ago|

[-]

These coding agent models only started getting useful in January. Before that they were difficult to control autocomplete, and not very smart.

January was an inflection point, and no open weights model has crossed over that same threshold.

This is definitely recursive self improvement territory, except that we're prohibited from participating.

It feels like the capability gap is wider than before.

reply

upvote

by lbreakjai10 hours ago|

[-]

Have you tried deepseek V4? It costs pennies and is as good as Opus 4.6 (I found 4.7 to be a downgrade, and cancelled my claude subscription before 4.8).

The threshold has definitely been crossed.

reply

upvote

by echelon2 hours ago|

[-]

It is not as good as Opus. I've tried to write Rust with it (and Codex for that matter), and it's awful.

reply

upvote

by slopinthebag15 hours ago|

[-]

It was more like November. But it wasn’t really an inflection point, harnesses got good enough that people started noticing by the holiday break. And I’m not discounting some good ol’ stealth marketing in there as well.

Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….

reply

upvote

by nomel19 hours ago|

[-]

> a LORA that's designed to inject bugs into your code

A statement like this, clearly, requires a reference.

reply

upvote

by mips_avatar19 hours ago|

[-]

From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)

reply

upvote

by bee_rider18 hours ago|

[-]

“Limit effectiveness” could mean introducing performance degradation in your code. Which is arguably some sort of performance bug (I mean, ML codes are supposed to be high performance so I’d call unnecessary degradation a bug), but it could be borderline.

reply

upvote

by rurban13 hours ago|

[-]

No, it is just a prominent "Cyber Security threat detected" blocker, with a button to appeal. I appealed because my work had nothing to do with neither cyber nor security, but the appeal was auto-closed. So no more Claude for this work.

reply

upvote

by nomel19 hours ago|

[-]

Thanks, I thought maybe I missed something. That's an interesting way to interpret that.

reply

upvote

by mips_avatar19 hours ago|

[-]

Anthropic is trying to hide bad behavior by being vague, it's important to not be vague when calling it out.

reply

upvote

by nomel19 hours ago|

[-]

I'm of the opinion that removing guardrails is how you force regulation. What's your opinion on the balance?

reply

upvote

by dannyw18 hours ago|

[-]

They have all transcripts for at least 30 days. The problem is that (as anyone who used Fable can attest) their classifiers are extremely sensitive and catch tons of innocent queries.

Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?

reply

upvote

by nomel17 hours ago|

[-]

Since your answer isn't direct, I'm having a little trouble interpreting it.

Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.

reply

upvote

by dannyw14 hours ago|

[-]

Sorry, I’m specifically referring to the silent degradation of the model to “limit frontier LLM development”. From the description, it appears to encapsulate far more than frontier LLM development, but general ML research and development too.

Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.

My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.

reply

upvote

by mips_avatar16 hours ago|

[-]

They’re not safety guardrails they’re anthropic doesn’t like anyone who isn’t anthropic working on AI rails

reply

upvote

by giancarlostoro19 hours ago|

[-]

PEFT is a library, one of its capabilities is to produce LoRAs.

See:

https://heidloff.net/article/efficient-fine-tuning-lora/

reply

upvote

by adw19 hours ago|

[-]

It's just an acronym, "parameter-efficient fine tuning". LoRA is one method, prefix tuning is another, there are more.

reply

upvote

by sciencejerk14 hours ago|

[-]

Are they trying to fight back against model distillation?

reply

upvote

by ComputerGuru20 hours ago|

[-]

Different restrictions. ML gets treated differently from the rest.

reply

upvote

by daedrdev20 hours ago|

[-]

Specifically only ML research

reply

upvote

by loneboat17 hours ago|

[-]

Aah my mistake. I had missed that ML had separate trigger behavior from cybersecurity/etc... Thanks.

reply