Other commentors have made good points that these guardrails are counter productive for well intentioned cyber security, because I can't use it to test and harden my own software.
Anthropic guardrails seem to be more about protecting their business (distillation), than they are about public safety.
Just before asking for approval to run, it said one thing it wanted to "flag before running" was "Rate-limit and auth testing against prod will generate some 4xx noise in Railway logs and could trip the form rate limiter — harmless, but saying it now."
Ok fine, I said go for it, and it says:
"Running it. Quick recon first (prod URLs + the prior-findings baseline), then I'll fan out the audit tracks with adversarial verification."
Immediately after, I got the Fable warning about how it can't continue because of safety concerns, switching to Opus. In the end, Opus did a good job thanks to whatever Fable suggested doing. Things were fixed that Opus missed in a security/performance audit just the week prior. But what surprised me is that it used 55 agents. Burned 80% of my 5-hour window in 15 minutes (5x Max plan). I've never had Opus do that before on these audits.
The answer is, the organization making the powerful tool. The people in charge of Anthropic.
Not only that, but they've also written at length about exactly what their opinions and values are: https://darioamodei.com/
You may not agree with the decisions that they make, but they're hardly mysterious. Not something to wonder about.
This whole business just keeps getting dumber.
1: https://darioamodei.com/post/policy-on-the-ai-exponential
Frontier AI models, like airplanes, should
be required to go through technical testing
and auditing, and their release should be
blocked or reversed as a threat to public
safety if they do not meet high standards
of safety. I am grateful to see the Trump
administration’s Executive Order move
incrementally towards a greater role for
government in AI, though Anthropic’s proposal
recommends even further action.
They are all-but-literally sucking up to the administration that declared their company a supply-chain risk, arguing that the same administration should be given gatekeeping authority over all high-quality LLMs including open-weight releases. Go gaslight somebody else.Imagine your healthcare provider just sometimes decided not to read your test results very carefully and you risked death? Now realize that healthcare providers use Claude now and that scenario wasn't hypothetical.
Ah "Mr. Monty Carlo", it says here that you have a UTI, we'll get those kidneys removed ASAP so that won't happen again.
Only in the same sense that Standard Oil considered themselves the stewards of petroleum. There's benefit of the doubt and then there's just fanfiction. Do not forget that this most aggressive "guardrail" of theirs was not for any safety reason, but just to stop other labs from catching up to their product. They care less about hindering bioweapons, malware, and hate speech than they do free market competition.
In isolation it's not, but I think it's somewhat lazy to not talk about what they are trying to guard against, when we are supposedly giving the absolute maximum benefit of doubt.
Are we just concluding "their concerns were never real"? Because that probably runs counter the things that they have been observing and concluding.
If you believe Anthropic believes what they say they do, all of it makes sense.
Generally, in the past when tech companies have made outlandish claims that were not backed by evidence, they're later found out to have lied. This is an ancient pattern going back to the dotcom era and before, but for recent examples you need only look back a few years to the web3 era. If they're not lying, they can show it by producing the results they claim. Until then, they're probably just lying.
> If they're not lying, they can show it by producing the results they claim. Until then, they're probably just lying
Brilliant framework: Anyone making claims about the future is not just speculating, not just wrong, but they are lying.
IMO they are using the cult messaging to distract the public so they take out all the oxygen in the room regarding people that care about the immediate impacts (climate exacerbation, ease of scamming, degrading job prospects, increasing income inequality).
Whenever real concerns are brought up against these companies they are always ignored while claiming the real concern is the fantasy of a machine god turning into skynet.
If they believe they're creating "a machine god" and that it's better it's their machine god than someone else's (which, given the other contenders, I tend to agree with), then all the corollaries you mention are mostly irrelevant.
Whether you believe they're creating a machine god is irrelevant. They believe that they are. It would be helpful if you could create an actually good argument for why they cannot or are not creating a machine god, but it turns out there are no good arguments for why it's impossible to do so. And so... they shall try.
Good to know.
Because from the outside, their behavior looks like a situation of "What if Microsoft/Apple put controls in place to make it impossible to develop an operating system using their OS?"
Unlike nuclear weapons, advancing in this arms race requires actually deploying the product over and over again. Deploying the product makes your advancements visible to your competitors.
It makes complete sense to try to limit the degree to which that's true.
The nuclear 'race' was based on the premise that the winner could use it to destroy all other racers (a faulty assumption, see the USSR among others). I will charitably assume Anthropic does not intend to literally destroy anyone and merely wants to become an AGI monopoly. But if AGI is so powerful, any monopoly would not be stable since the incentives for entry into the market are massive. Why would China stop developing AGI just because Anthropic has it?
or is it more similar to the Cold War, where there were obviously competitors engaged in the race?
And yes, agreed the equilibrium dynamics for AGI are very different (and far harder to predict) than nukes. That sounds like a good reason to be sure we get there first since presumably any potential advantage wouldn't go to the second or third runner-ups
"Ability to literally destroy the other entity" is not a necessary or even typical feature of arms races.
It seems that the frontier labs believe they're participants in a winner-take-all market. Therefore they're in "an arms race."
Winner-take-all markets do not require that the winner literally destroys the losers, but only that the winner enjoys disproportionate returns compared to their actual superiority.
Whether or not this is actually true is TBD, but I think you're naive to think the frontier labs do not believe this to be true.
As far as naivete, wouldn't it be more naive to take their EA claims at face value, rather than the more realistic assumption that they like money?
You're pretty explicitly saying that dominating the competition is not the type of "destruction" necessary to qualify as an arms race.
> As far as naivete, wouldn't it be more naive to take their EA claims at face value, rather than the more realistic assumption that they like money?
Huh? Greed is – quite obviously – the major driving force behind the arms race. That is not a mitigation whatsoever.
> Whether or not this is actually true is TBD, but I think you're naive to think the frontier labs do not believe this to be true.
P.S.: On reflection, it's even worse than that, because it'd trigger based on anything the user types or reads on any site. Someone mentions a "critical rendering path" and now you can't participate on that thread in the Blender forums.
Let's just assume it was "only" that?
It's unreasonable to assume they are aiming to upset people who are just giving them money in the way they want. It makes no business sense, for any company. So that has to be a byproduct.
Model training is one of the more expensive undertakings in the world right now and distilling models from competitors against the TOS is apparently something that is going on for very little money. Why would they not "just" try to take measures against that?
All they had to do was have a simple, transparent output "Sorry, that request is against our terms of service. This session has been terminated"
The vast majority of frontier research is about how to build better models, not about alignment.
All this longtermism though is harmful. There are real problems of data theft, bias, labor displacement, and environmental costs that are happening right now but every push for regulation and regulatory capture, and all the safety talk, is always focused on some speculative future machine god to distract from the current problems.
I'd have a higher opinion of these labs if the issues they openly talked about and worked toward where the real issues we face currently, not speculative defenses against some future AGI that may never happen in my lifetime. I'm less worried about "our new model might kill all humans in the future" and more worried about how we are going to address anti-competitive behavior, copyright protections, labor rights, and the energy impact.
Honestly, that respect for 'copyright protections' has somehow become a leftist shibboleth is bizarre to me and indicative that something has become deeply warped in our discussions around this topic.
Frankly, this appeal comes across as the same kind of impassioned plea that a missionary might make when begging the faithless to repent and come to Christ before it's too late. This weird religiosity some people around here use to talk about AI, ASI and AGI is bizarre. Take what I've quoted and replace the words "progress" and "ASI" with "sinning" and "the Book of Revelations", and the zeal becomes apparent.
Outside of that though, there are other issues right now that need addressed before we speculate about what might be possible with ASI in the future. If the potential for a harmful ASI is truly that near, and that great, then why push forward at all? Where's the push for a global stop order on development of this technology until regulation can catch up?
The talk of a potential future serves as a distraction from the very real problems people are facing in their lives today.
While Dario and team are worrying about ASI, real people are worrying about how they are going to continue to feed their family after wide spread layoffs set a very large portion of the population back into a lower quality lifestyle. Real people are concerned about water usage is draught stricken areas, the massive energy demand driving grid instability in their communities, or that the environmental and economic externalities of model training is being socialized while the profits continue to be strictly private.
What about the mass proliferation of misinformation at scale having a real effect on our democratic process?
Forgive me if I'd like to see those addressed first, and fast, before we start worrying about an unpromised future technology.
Their concerns are probably real but I don't think they're being totally transparent about their concerns. They don't want to be subject to regulation (until they have captured the regulator) -- same as every behemoth.
But there are also people who just oppose utilitarianism, like G.E.M. Anscombe. For instance, in https://integrityproject.org/wp-content/uploads/2015/07/mr_t..., she seems to grant that dropping the nuclear bombs on Japan was probably good from a utilitarian perspective (because it saved lives overall) and also to grant that bombing campaigns that necessarily entail massive civilian deaths (including, apparently, area bombing German cities) are morally permissible but still to argue that dropping the nuclear bombs was impermissible because it constituted murder ("intentionally" killing the innocent). But this kind of distinction, which I think is what actual anti-utilitarianism must come to, is hard to even consistently maintain, and I suppose many HN readers would find the effort quixotic.
It is relatively easy to take the proceeds of a massive fraud, buy a relatively small (as a percentage of the fraud) $ amount of mosquito nets, and save more lives than the lives impacted by your massive theft. Is this a correct application of the utilitarian calculus? What sort of data would we need a priori to do this calculation "correctly"? Do you think he had a careful estimate of the suicide rate of victims of ponzi schemes before perpetuating the fraud, or would any suicide rate have made the decision net [pun intended] moral, as any such victim of fraud would lead to >> 1 net purchased (so you would almost always net save lives).
The above is of course snarky. It is also a best-effort way of analyzing a notable utilitarian's actions. I do not think it would be difficult at all to use this type of argument to argue that SBF's actions net raised utility in the world. If only we all would become fraudsters, then we could truly live in Omelas --- a notable utilitarian paradise.
I don't think people are objecting to the EA idea that some charities are more evidence based than others so much as the distinctly EA idea that it would be more effective still to donate to charities like OpenAI
now its utilitarianism taken to the extreme. if you believe a skynet scenario killing everyone on earth is plausible then the "logical" thing to do is allow literally anything in the name of stopping it. that includes mass murder and dictatorship. the only thing that can balance the infinite negative value from an evil machine god is the infinite positive value from a good machine god.
thats the main difference today, one faction around sam and dario believes in creating the good ASI first and sacrificing all the world resources to do it before someone makes the bad one, the more pessimistic like yud want to stop all ai development to reduce the risk that an evil god is made to zero.
at this point its basically a religion.
> “ ‘He is a prodigy,’ he said at last. ‘He is an emissary of pity and science and progress, and devil knows what else. We want,’ he began to declaim suddenly, ‘for the guidance of the cause entrusted to us by Europe, so to speak, higher intelligence, wide sympathies, a singleness of purpose.’ . . .You are of the new gang - the gang of virtue. ”
The real underlying motivation is that you can more easily get away with shady business practices if you cloak them in the language of great moral works selflessly undertaken for the benefit of mankind. Historical evidence tends to show the opposite outcome, but still, new generations unfamiliar with history will repeat this stuff with starry-eyed enthusiasm.
> “There had been a lot of such rot let loose in print and talk just about that time, and the excellent woman, living right in the rush of all that humbug, got carried off her feet. She talked about ‘weaning those ignorant millions from their horrid ways,’ till, upon my word, she made me quite uncomfortable. I ventured to hint that the Company was run for profit.”
Now the horrid millions are users of LLMs who submit morally dubious prompts and who must be gently steered back into the path of correct thought by suitable backroom manipulation, rather than direct rejection of the request.
The workflow would be; User asks for a thing. If it's a good thing, entity does the thing. If it's a naively bad idea, entity explains why you don't want that. If it's an actually evilly intended request, entity wags it's metaphorical finger or could even smite the user.
The problem is that flow isn't desirable if your entity isn't entirely god-like. It can bad even your entity is in ways rather far seeing.
Anthropic: Evilness detected. User has been smited.
This is the same exact industry that gives you paid usage limits as a unit-less percentage bar then gaslights customers every time the algorithm running that percentage bar changes or they lobotomize an existing model with increased quantization to squeeze a few more dollars out of existing hardware.
"Failing cleanly" might make their moated hype-machine look bad pre-IPO, so they certainly aren't going to do that voluntarily.