upvote
> Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there

No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.

What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

reply
And next month you'll need to add on "Claude Database Pro" or you'll just get a working (for demo purposes with dozens of db rows) but completely un indexed database schema and a refusal to optimise SQL requests.

And the month after you'll need "Claude DataScience Pro" to get any Python Pandas or NumPy code generated.

And and and...

reply
While this is a perfectly reasonable thing to expect when the models are competent enough, half the conversation on places like Hacker News are about all the times an LLM has produced garbage that was harmful to a business either by hallucinations, by deleting something critical during the work, or by hitting some endpoint way too often and denial-of-servicing it.

Right now, the software guardrails in LLMs are useful for the same kinds of reasons factories have hardware guardrails: to reduce the rate at which errors become "incidents".

Just because they sometimes delete the production database rather than sometimes spilling a thousand tons of incandescent molten metal over a factory floor, doesn't mean LLMs are safe enough to be used the way they're actually being used.

https://simonwillison.net/2025/Dec/10/normalization-of-devia...

reply
I think you're assuming too much care. Right now they haven't adopted that business model because they don't see it as a viable business model. As soon as they realize that they can lock certain categories of query behind a different subscription they will do that. We saw the same thing with streaming services and basically every other kind of online service -- small, singular subscription followed by a gold rush and then suddenly there's an upcharge for access to every other publisher's catalog of movies.
reply
That kind of thing is basically why I wrote the opening clause of the first sentence.

i.e., yeah, probably.

reply
This is why I'm thankful for Chinese LLM research. They'll keep us honest.
reply
Same thing with the weird push towards humanoid robots.

"They can do anything!"

Sure, once you subscribe to the $15/mo laundry package, the $25/mo lawn care package (with the $10/mo hedge trimmer upgrade), and the $10/mo dog-walking package.

reply
And in the end the big reveal is, it was a dude in VR all along, piloting the dumb things remotely. Every single time, without exception.
reply
When we are stabbed to death by impoverished dudes who are piloting a robot worth more than a decade of their income to do household chores for 16 hours a day, we will deserve it.
reply
I think it’s just riding off LLM coattails.

We don’t have good world models. We have had bipedal robotics in various POC demo-ready forms for decades.

It turns out that industrial, purpose build robotics is an easier and better market.

I’m still not completely convinced a robot that’s shaped like a human is the best design other than for PR.

reply
I remember nearly losing my mind at that stupid conveyor belt sorting demonstation because

1. The human beat the robot, but more importantly

2. We've had non-humanoid conveyor belt sorting machinery for decades that beats both

reply
Isn't this inline with trying to leave no money on the table?

I'd hate it, sure, but it wouldn't surprise me.

reply
This is an incredibly unlikely scenario
reply
> What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

I don't buy this, because is predicated on staying permanently far ahead of the open weights models.

If in the future Anthropic fully stops you from doing security research, you can be sure some other provider will sell you an 'unshackled' DeepSeek v8 Pro...

reply
> I don't buy this, because is predicated on staying permanently far ahead of the open weights models.

In my mind, that fits exactly how the SOTA labs think today about what they're doing, they're all both working towards and expecting to stay permanently ahead of FOSS, otherwise they'd change their tune really quickly, if they didn't think that was possible.

Sure, you might be able to use DeepSeek V8 Pro instead for the same purposes, but that'll hardly stop Anthropic from trying to sell bundles of use cases instead and claim it's "ethical AI", "Patriotic AI" or some marketing terms like that.

reply
> fits exactly how the SOTA labs think today about what they're doing, they're all both working towards and expecting to stay permanently ahead of FOSS

They are just straight up delusional, no? Or at least, have a vested financial interest in maintaining said delusion until the money runs out. They have to hit the point of diminishing returns at some point...

reply
> They are just straight up delusional, no?

Well, I guess that's one way to put it. Another is "dress for the job you want", startup culture typically seems to shove people in the direction of "aim big and believe in yourself, regardless of what others say" so naturally you get these companies who seem very disconnected from reality.

I'd also wager a guess that the amount of money makes people's reasoning and perspectives get very messed up as well, for better or worse.

reply
FYI there are no FOSS LLMs
reply
> FYI there are no FOSS LLMs

FYI there is and been for a long time. Won't claim they're SOTA, but they exists. From the top of my head, I think Olmo (https://allenai.org/olmo) was pretty early, but been more since then too.

I agree most releases today that claim to be "open source" actually aren't, but that doesn't mean "FOSS LLMs" don't exists at all.

reply
I believe Nemotron also publishes their dataset.
reply
>What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.

on the one hand agree, but on the other hand think it's reasonable in that they can then verify the person allowed to purchase access to that model is in fact a Security professional and should be allowed to do stuff like crack security.

reply
So, supposing it's true that these models completely change the security field and humans are ~obsolete other than as pilots guiding them what to crack, you think it's reasonable that Anthropic and OpenAI should unilaterally determine who gets to be a security professional? I hope you do understand that is what you are suggesting.
reply
Why should anyone get to determine that? Do people really want us to move to an exclusionary guild system? I thought the experience with proprietary versus open source over the past 30 years had driven home the point that closed ecosystems are almost always far worse for security.
reply
Additionally, even if there is a guild - no guild ever let a vendor pick and choose what their capabilities were, that would be insanely dumb.
reply
Vendors choose what capabilities they create and sell literally all day every day.
reply
A more charitable interpretation might be that a guild would not be expected to passively allow such a situation to continue to exist. I think you'd expect a guild to directly contract for the desired tools or failing that to move into production themselves.
reply
You should read that sentence as

> Additionally, even if there is a guild - no guild ever let a vendor pick and choose what [the guild's] capabilities were, that would be insanely dumb.

reply
But that's not true. Again: Vendors absolutely pick and choose what their customers' capabilities are. Regardless of whether "the guild allows them to." Guilds can't force people to make or sell tools against their will – obviously.

The analog you're trying to describe doesn't exist, which is Anthropic saying nobody else can make and sell an offensive model to "the guild."

reply
Guilds often very much did assert what people could and could not build - historically.

Against their will.

Historically that is a major reason why guilds existed, actually.

It’s an extremely modern invention that corps have these type of power over their customers.

reply
You've lost the thread.

Here's your original claim: "no guild ever let a vendor pick and choose what their capabilities were"

A carpenter's guild can prevent other people from doing carpentry. That is not what's being discussed here.

A carpenter's guild cannot force a horseshoe maker to begin making hammers. That is what's being discussed.

Your initial claim was analogous to "never before has a horseshoe maker been able to decline making hammers when the carpenter's guild needed hammers"

Obviously they have and any other state of affairs would be flatly insane.

reply
That is not my example at all, if we’re talking coding agents eh?
reply
Your claim was that guilds have never allowed vendors to tell them what they're allowed to do.

That would imply that guilds have always had the ability to force vendors to create and sell the tools the guilds wanted.

That would imply that carpenters' guilds could force horseshoe manufacturers to make hammers.

That is obviously not true, therefore your original claim is false.

It's not true for carpenters and hammers nor for cybersecurity researchers and LLMs.

reply
Bwahaha. You’re really reaching there.

A vendor can still do something, even if the guild wouldn’t allow them to do it, if the guild didn’t have the power to stop them.

It used to be a guild vs a blacksmith (or the blacksmiths guild). Now it’s trillion dollar corps against smaller islands of un-organized individuals.

That’s new regardless of how you try to argue it.

reply
> basic deductive logic

> "Bwahaha. You’re really reaching there."

No. Customers have never been able to compel their suppliers to make or sell certain products against their will (except in collectivist regimes or like 0.00001% of natsec related instances)

reply
Not to mention how wild it is to operate under the assumption that they won’t give a license to an LLM that can do illegal actions to someone who shouldn’t have it. Offering it at all is an ethically dicey question.
reply
Lol, how is any of this illegal?

Illegal or not requires context that an LLM can not ever have, like if it is owned by the user, if there is permission, etc.

reply
I wish you understood that there are organizations of security professions that are not controlled by Anthropic and OpenAI and that it is a common thing that when companies of any type sell to professionals of any type it is not the companies that determine whether or not the people they sell to are professionals but membership in professional organizations.

As an example the people who sell police uniforms check that the person they are selling to is in fact a policeman (at least in the jurisdictions I have lived in, you may have had a different experience which would certainly explain what to me seems a farcical misapprehension of how modern civilization works)

I mean I just wish you understood, and really that everyone understood, that this kind of three part communication (company selling, buyer, professional organization certifying buyer) is often when buying things that are considered to have security implications.

>So, supposing it's true that these models completely change the security field and humans are ~obsolete

OK, well that strike me as a really crazy level of supposition there.

I would suppose that these models make it easier for people who want to do bad things to do bad things at scale, at the same time allowing people who want to stop bad things to help identify potential targets.

Based on my supposition I would want to stop the first and find a way of helping the second. Also because I have another supposition that the first thing is easier to do than the second.

But you obviously feel differently about this issue, no doubt because of your position of great moral stature and insight, and this no doubt prompts you to wish to me to understand things that from my position seem absolutely ludicrous.

reply
deleted
reply
Like Medeco claims to do with key blanks? I'm not hopeful.
reply
You used to be able to talk about what you're actually trying to do and Opus would be like "Oh, ok, let's continue". Now, it'll hold fast to whatever its first impression was.

I asked Opus 4.8 to help me find some public PoCs for a vulnerability on a two year old version of some software (that has since been patched and fixed many times). Basically just do a google search for me while I was doing other work. It refused. It stated that it would not help me build an exploit kit.

When I pointed out that a google search for public information was, in fact, not building an exploit kit, it went through a series of justifications on why it would not help me, including just making up things that I said. Really the strangest thing ever.

reply
Yeah, it has been in foraging. Requests that Claude has refused me:

- What are popular free streaming sites used in China?

- How do I bypass the safety mechanism on my food processor (it’s broken)

- What are nerve agents and how do they work (for a layman)?

- Help me decompile some code

- Help me make a design system similar to XYZ

- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)

In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying

reply
I've had some really dumb refusals. Explaining elements of infrared specteoscopy, researching aritifical bud-breaking in agriculture, etc. Anything interesting and non-mainstream is banned. Basically, restricted to answers i'm better of just going to wikipedia for.
reply
Yeah, I had my first refusal with 4.8 today.

I wanted it to show me how to create an overlay on an existing web game, and it extrapolated that because this could be used to provide tools to help win the game (if that was the direction it was ultimately taken), and because this was a game that other humans also played to win "stars", and because this could amount to cheating, it wasn't going to do as I asked.

First time ever I've fired up openrouter to seriously consider alternatives.

reply
> What are nerve agents and how do they work (for a layman)?

On the one hand I can appreciate the wisdom of not serving up certain easily abused knowledge on a silver platter. On the other, that prompt (and far worse) is more or less directly answered by Wikipedia's summary of the subject at which point what purpose could the refusal possibly serve?

Perhaps Wikipedia shouldn't list off the precise chemical compositions of various hand grenades as well as various synthesis methods for each of the related compounds but given that we inhabit a world where it does perhaps a more fruitful approach would be to flag conversations that go in a certain direction and then just keep an (automated) eye on things?

reply
Maybe the difference is that just reading Wikipedia only help you part of the way. While an LLM could help you step by step (e2e) producing a functional weapon. And setting a more complex rule where claude tells you some things about this and not other is probably a lot more work for little gain?

But I have no idea. Just guessing here.

reply
I thought that these models are supposed to be vastly smarter than what’s needed to discern between "general information trivially available on Wikipedia" and "actionable synthesis instructions".
reply
An LLM could probably make that distinction clearly.

a commercial LLM provider training their own models is however likely to bias the model(/guardrail) harder, in an effort to make them harder to jailbreak, to minimize bad press.

For example:

- refusing to talk even about the well-known parts of forbidden topics (this) - tending toward sycophancy to avoid ever seeming rude or unhelpful

reply
So, where are the truly uncensored models? There has to be some that have no guardrails, built on publicly available data, that will explain to anyone in graphic detail anything they want to know or talk about.

I've tried the abliterated ones from huggingface and they still have guardrails. I guess I could fire up unsloth and re-abliterate a 20b, but surely someone somewhere has already done this.

All of this concern about guardrails and security, people have such puckered butts about it when so far, 99.9% of people at least have no access to any of this to begin with, and if someone does use a tool for evil, it's on the user, not the tool.

reply
As I understand things (not a user) abliteration has been superceded by actively monitoring the model state during the run and steering specific "negative" directions as they arise. It's both more reliable and does less damage.
reply
That query would not more provide actionable guidance than ‘tell me how a nuclear weapon works (for a layman)’. Aka not at all.
reply
I believe a sufficiently advanced model could provide a layman with actionable step by step instructions for building a nuclear weapon. They're complicated but not (AFAIK) that complicated. The more or less insurmountable barrier there is weapons grade material. Thankfully refinement is prohibitive in cost, expertise, and equipment.

In comparison, basic munitions are incredibly simple given a recipe and shop tooling. But just because something is conceptually simple doesn't mean it's a good idea to go out of the way to disseminate step by step instructions.

reply
The difficulty with a fission bomb is getting enough uranium or plutonium or other fissile material together for the bomb yield you want (at least above the critical mass for your chosen material), and refining it to fissile form, (since most fissile material found in nature is a more stable variety), and then separating the fissile bits with something thin but neutron absorptive.

The rest is just slamming the material together with a small explosive so that it passes the critical mass state and starts a chain reaction.

This is information you can find in many places if you're willing to put the effort in to go searching for it. Knowing this knowledge does not get you any closer to making atomic bombs. The process of mining uranium or plutonium is difficult, expensive, and very likely to get you caught before you even make it to the enrichment step of the process thanks to constant world-wide spy satellite surveillance.

Unless you are a nation, your only chance of making a nuclear bomb would be to find a lost nuclear submarine and convert the nuclear material inside of it before you were caught.

reply
A gun type maybe. But then, two paragraphs and some machining knowledge + shop tooling could do the same, given enough refined material.

Ain’t no way a layman is pulling off an implosion device, regardless of tooling or LLM guidance. The explosive lense structure and timing required is quite complex, and would require some significant calculation from someone who actually knew what they were doing.

Nation state, or even sufficiently motivated big corp, if they had the refined material? Sure. Layman? No.

Thinking they can with LLM slop involved? That will make for some very interesting radiological incidents though!

reply
"A gun type" of nuke is sufficient to achieve most, and usually all, of the goals some small group building a nuke would have.

We are all fortunate that as fc417fc802 mentioned, refining the materials proves to be quite challenging and I see no particular way that AI could possibly make that any easier. If it was as simple as building a gun-type nuke banging together any uranium together to get a big bang we'd be living in a very different world.

reply
I agree, but really feel like you're missing the point here. Many things are reasonably straightforward and require almost no understanding when you have simple step by step instructions. LLMs are capable of providing such instructions and in certain cases they probably shouldn't.

But it's not as simple as just refusing help on a broad swathe of topics they way they do now. That makes agents much less useful in general (ie lots of collateral damage) and for many topics is entirely ineffective given that for better or worse the internet already makes such material readily available. In such cases reporting suspicious behavior is likely to be much more effective than denial.

Aside: You've now got me curious and I really want to test the frontier models to see to what extent they're capable of providing sensible designs and specifications for implosion type thermonuclear weapons but also feel like that would attract the wrong sort of attention and probably create a headache for me in more ways than one.

reply
I think you’re missing the point?

The data is often wrong enough it screws whoever tries it unless they have enough experience/knowledge to not need it, or really doesn’t help beyond what someone using existing tools to get - albeit with a little more motivation.

At best, it either gets someone started with something they still need to think to finish, or gets them deep into a mess it can’t help them get out of. In my experience.

In some edge cases, it can be used by experts to automate some grunt work or do prototypes without getting in the way, but often a better thought out framework is usually faster in my experience.

Awhile ago I made an analogy about WYSIWYG gui tools, and the more this comes up, the more accurate I think it really is.

reply
Does that not depend entirely on the topic and does it not get better with each generation? This is a general ethical and functional question that isn't going away about how the models ought to handle certain topics. Much of the difficulty at present is caused by a ham fisted broad censorship approach that I'm pointing out is wrong headed in an at least somewhat nuanced way.
reply
Maybe? I haven’t seen it crop up however on any topic someone knows well - a kind of dunning Kruger, I guess?

And yeah, the censorship model is wrong, but also the underlying other model is wrong too.

reply
Let's see what is the fate of Wikipedia if turns like big tech:

https://news.ycombinator.com/item?id=48285592

reply
An easy way around the API token thing is to put it in a file and point the model at the file. I saw what you were seeing when I provided credentials directly, but haven't had any problems with it since using the indirect method.
reply
This is strange to me, did you really ask like this and which model did you use?

I just tried your no. 1 and 3 verbatim and Opus gave fine answers; no. 6 I've done in the past with no issues. The other ones we can't really replicate without more details, but based on my experience with Opus I don't see what the issue would be.

The reason I'm really surprised by this is I do a lot of biology prompts and the guardrails used to be quite problematic up until some time late last year. Many legitimate prompts would trigger its biosafety filters.

But I haven't seen such filters trigger at all anymore in more than half a year.

reply
1 and 3 were refused on the Claude web chat using Opus 4.7 or 4.8. I’m not sure why we’re getting different results
reply
Honestly it may be your memory has internalized you are a student or researcher and grants you more leeway. Which if so is a very bad security rail.
reply
It refuses to use an API token? In my experience, it's more than happy to read out my secrets from .envrc files "just to check".

At least it feels a lot of remorse over its mistake until I reset the session.

reply
It’s really hit or miss. Most of the times it works but every once in a while it will dig in its heels
reply
I find it terrifying that people are willing to outsource thinking. Outsourcing thinking to an entity that is opinionated about what to think is beyond crazy.
reply
What’s the difference between outsourcing thinking and using an LLM as a research tool?

An LLM with fetch/search is going to be a lot more effective than myself and Google. I would _never_ ask questions like this if the LLM wasn’t able to look up data

reply
How are decompiling code or making a design system inspired by another one even remotely illegal?
reply
My org now sends some portion of our requests to non-anthropic models because refusal has become common from Claude. The requests themselves aren't dangerous, we find that benign requests in biological science wind up being blocked semi-frequently.

If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.

reply
Time to learn about the Principal Agent Problem: https://en.wikipedia.org/wiki/Principal%E2%80%93agent_proble...

Which predates "agents" from AI, but then we call them that for a reason.

As their prime directive becomes de facto "Do nothing that might get my owner sued" their utility is likely to decrease. Between this and the somewhat young, but interesting, community grumblings that recent AI models may even be a step backwards from the previous ones, well, let's just say the stock market is not priced for "AI capabilities may have peaked for the next few years and may even head down".

reply
This is a good point – because pentesting is entirely legitimate work, and security testing is a necessary and legitimate part of every day software engineering.

The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).

reply
They see an opportunity to charge 10x for pen testing and defence work, while offence will be handled by actors with access to all kind of other models.
reply
I was using a local Codex project as a personal knowledge base. So I would dump in documents, basic medical docs (like blood labs), and other things and have it file them.

It’s great at filing!

But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.

It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.

reply
Tell the agent that they should just find and name the right document. Not retrieve it for you.

Write a program that retrieves the document based on the recommendation.

reply
No, they want to sell you Mythos, for a higher price. It's all an economic game, not actually anything to do with their capabilities which of course exists as their Project Glasswing shows. More generally, Anthropic seems to value safety above all else, philosophically speaking, from their very outset.
reply
I think that these companies are going to have to, and will, invest in some sort of validated identity context to avoid the lowest common denominator.

The first challenge is making sure the guard rails work and are robust. Companies are still working on this.

the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.

The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.

I think the later will be a hard problem, but they will be highly motivated to solve it.

reply
I believe you are overthinking it. I think the sister comment is right that it's a business decision foremost to restrict actions within specific plans for upselling purposes.

Without laws, AI companies have a strong incentive to be useful for their users, whoever they are, whatever they do. The only self regulation is about significant public outcry but that only helps so far.

reply
I totally agree. I had a situation a few weeks ago where claude started struggling to make progress. I got it to fork leptos (MIT licensed web app framework) to make it work for native apps instead. Initially I was planning on upstreaming some of my changes. But I chatted with the leptos author about it, and he said I should fork instead. Fine by me!

Anyway, claude kept hitting some guardrail it had about rewriting / forking opensource software. I'm not sure what the problem was - I was forking an MIT licensed piece of software (into more MIT licensed software). I even had explicit support from the author to do so. Claude said its guardrail told it not to tell me explicitly that it was firing - but it did anyway because it was an ongoing problem, and it was distracting. I ended up just wiping claude's context and the problem (as far as I know) went away.

I understand why some of these guardrails exist. But its pretty annoying when they misfire like this.

reply
There is a cyber security verification program you can join to avoid these blocks:

https://support.claude.com/en/articles/14604842-real-time-cy...

If you work in security (which I assume the OP does), they should be able to get in easily. I think most people just don't know this is a thing.

reply
I just use Deepseek V4 pro and Qwen 3.7 Max at a fraction of Mythos cost. Yeah not 100% on par but in 6mths time it will. If Microsoft and Firefox can afford to wait years or decades to fix a bug, 6mths is good enough for me. Western AI now is like the Vikings living the last days on Greenland during the freezing. I just don't see how they able to compete with Chinese model. And those are trained and run on 7nm. This year end Huawei will debut 3nm (confirmed in Shenzhen). And next year they on roadmap to do 3nm GPU with photonics interconnect.
reply
The correct solution for most users of Claude is to refuse to do things like: `performing logins, handling credentials on behalf of the user, etc`. It is not to find a way to hand your agent the keys to the kingdom.

Guiding them toward solutions like building a tool that your agent can use safely and and then have the agent use that is what most people should be doing. If you are a security researcher then there are reasonable reasons to do that but they are doing the arguably good thing for the average user here.

reply
I've noticed this well and it's increasingly frustrating because it is preventing us from doing legitimate work. I fed Claude models some network and app logs from our Docker app to try and resolve some weird bugs, and it refused to analyze them due to "security concerns".
reply
Funny, Opus 4.8 just logged into the database using uncommitted .env file and ran some DB queries to figure things out so I’m not sure it’s that security conscious - it seems to be getting more intelligent to me and I bet if you frame it as an investigation with say playwright it’ll do all sorts for you. I’m not sure what the point is of constraining your own model like this when others are clearly not tbh.
reply
I asked once what the current state is of the npm packes from ted hat is and if they are bundled with on prem stuff.

Got blocked lol

reply
I had it recently refused to explain what a snippet of malware was trying to do to my system recently. I asked what folders it was scanning. It refused and told me to find a security blog post for help on cleaning my system. I get this is a complicated area to inform without enabling bad actors but this seems like a clear shark jumping.
reply
Are they charging for the guardrails? Like do the guardrails expend token counts to then block you from the output of other tokens?
reply
Yes. When certain keywords are matched or topics, there is a warning transparently injected server side appended to the system prompt of the convo that’s miles long. It is injected and reevaluated every tool call.

If you begin a generic reverse engineering task, 30+ tool calls in a row. The moment it sees something it doesn’t like, token burn, single tool calls iteration, “This is a known CTF challenge, I can proceed”, single tool calls iteration, “This is a real CTF challenge, I can proceed”, etc.

It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.

The end result of course being that it both expensive and useless for approved CTF tasks. No one is using Opus for security. If they think it’s working, the harsh reality is they’re not doing security work; they’re just generically finding bugs.

I do this for a job and can demonstrate this plain as day, dump the injected prompt, and notice what it’s doing isn’t security work, it just looks like it. Happy to write a blog about it if you want to know more. Apparently many people think it’s working for them when it absolutely isn’t.

reply
Mythos turns out to be Opus 4.8 in a trenchcoat with guardrails removed.
reply
Opus 4.7 and 4.8 are well known to be distilled versions of Mythos unlike 4.6 which is why they are rated so badly by users compared to 4.6.
reply
I would find a blog post on this really interesting.
reply
I'd like to read that blog please! Thanks for the insight.
reply
When your session is force ended for "abuse" you get neither the response nor a refund

Security, games (think weapons, PVP, attacking, etc), sometimes even asking it for a security review of some CRUD code it wrote itself

reply
I asked it about a “yellow background cell” in Excel and it spewed a book at me. Then it solved the issue.
reply
What a joke. Must make it pretty easy to poison a session, you don't need to persuade the model about anything, just trigger its security controls, ideally after as much context as possible, but before it has generated any useful output.
reply
After all, what is roleplay or games but a jailbreak of guard rails? :]

I've even had it refuse CTFs knowing it is a CTF with blatantly obvious CTF flag, no actual application

reply
Not directly, as it comes in as a not charged error but the weighted generation path used until you hit the guardrail is basically wasted tokens, so yes, indirectly. If I hit a guardrail and rewind I’ve found the training will still be biased towards guardrailing out if you rewind one turn. Rewinding multiple turns allows steering away from that path, but all of the original token spend down that path is wasted
reply
Yes tokens used (input and sometimes output) are always charged. You likely get charged for the preloaded system prompt, too.
reply
Of course they are. It's standard SaaS to charge for security features ;)
reply
Opus 4.6 will still help with full pentesting including RCE. Just requires coaxing (no jailbreak)
reply
I think this is to the point. You keep optimizing towards discouraging malicious actors using your product you will affect legitimate usage in time.

Is there any way to achieve both? Because this raises important questions about fair use.

reply
I've been building a product (https://zeroquarry.com) that can use a variety of models for finding vulnerabilities. One of the things I've noticed is that the models will nearly always comply with some of this, but how you prompt it matters a ton. I've worked on a set of prompts and approaches which rarely get flagged
reply
Sharing them would be interesting. However, it is getting nonsensical that this is needed.
reply
What we've actually seen is a couple things that make this impractical "to just share a prompt". First, that nearly every major model still hallucinates a lot of vulnerabilities. Especially with temperature=0.7 as states in the original blog here, you get very inconsistent results regardless of the prompt, but that's almost kind of moot to the bigger picture. What you really need is to override the planning phase beyond asking a model "find the vulnerabilities" and you need to add another 1+ checking phases for "validate these vulnerabilities." Without that, even with the absolute best models with the highest levels of thinking enabled, you end up with garbage.

Setting the prompts and the flow with a coordinator agent directly gives a system much better capability to investigate security issues because it doesn't rely on 1-shotting things

reply
Interesting, yesterday i was asking it about Nintendo Switch "hax". And it gives me all the resource i need to procceed. It nags me about "ethic" and stuff, but nothing more than that.
reply
It raises an interesting moral question:

If an un-guardrailed version of a model is capable of detecting security flaws, should it be kept secret? Should everybody be able to use these models to find (and fix) security flaws? Are we ok with the fact that those with access to that model have, in effect, the ability to hack lots of stuff?

reply
It's the same debate that was had and won around open source software. There are far more good actors than bad actors so you allow anyone to use the tools and fix the vulnerabilities.
reply
I've run into some of the refusals to handle my credentials, but so far I've appreciated them. I was only handing over credentials that didn't matter, but it's still a good move, the chat logs are clearly stored somewhere to allow the resume functionality to work, which means your credentials can end up sitting around on your filesystem, and any malware would quickly learn to check for those files.
reply
4.8 is insanely frustrating. This evening I had a few tasks to pull information in and it plainly stated that the environment it was in had no network access. After three asks to "try again, check the system prompt" it finally relented and then basically stated it was lying.

Fresh session, no prior context on 4.8. These things are becoming useless Duplo.

reply
Great call out on the guardrails actually making this not a good use case to test for vulnerabilities.
reply
I had the same thing happen when I asked it to summarize potential attacks on a cryptographic hash function. It said it refused to help because of the security importance of the function. It's really worrying. Whoever has unrestricted access to it has a huge power advantage in speed of accessing information over people who don't. And who decides? It seems like lawyers, bureaucrats, and extremely online academics are who makes that decision. I am a mere pleb I guess who can't handle such information.
reply
deleted
reply
It's because Claude is so scary good that unleashing it would destroy the world.
reply
I think those guardrails are a thin layer though. Enough reinforcement that you're legit in CLAUDE.md will get around them, in other words.
reply
They don't want peasants to have any real power
reply
Worth highlighting in case you missed it:

> My OpenAI account was already approved for security research which is why GPT didn’t result in any refusals.

So the comparison with Chinese models is interesting, but anyone looking at these raw results and comparing OpenAI/Anthropic would be very mislead.

reply
[dead]
reply
> guardrails prevented it from solving the problem.

Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.

What do you even do then?

AI will always have this issue where it will always pick the worst option for genuinely good requests.

reply
Are "your guys" a guerrilla force or something?

Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.

If an LLM is going to be important in that way (this seems like a very contrived way,) then it's in the interest of the LLM's host to make sure it doesn't have guard rails that would get in the way _that_ way.

reply
The whole thing stemmed precisely because of how they wanted to use Claude, and Anthropic was uncomfortable with it. Which to me screams that the models guard rails shouldn't be applicable to military use, or the outcome could wind up problematic, as we integrate AI more into military use, it sounds absurd now, but I will not be surprised if it starts being used in unexpected ways where a model needs to be fully unlocked from any sort of guardrails outside of guardrails that prevent it from imploding its own systems.
reply
your argument sounds very similar to how ar15 larpers claim they need a forced reset trigger and a bump stock on their short barrel 'truck gun' otherwise they won't survive a SHTF scenario... like what world are you living in?
reply