Teaching Claude Why

upvote

Teaching Claude Why

(www.anthropic.com)

173 points

by pretext17 hours ago |

upvote

by justonepost211 hours ago|

[-]

If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned?

If the answer is “yes”, our definition of alignment kind of sucks.

reply

upvote

by ben_w6 hours ago|

[-]

> If the answer is “yes”, our definition of alignment kind of sucks.

Sure, but the original sense of this is rather more fundamental than "does this timeline suck?"

Right now, it is still an open question "do we know how to reliably scale up AI to be generally more competent than we are at everything without literally killing everyone due to (1) some small bug when we created the the loss function* it was trained on (outer alignment), or (2) if that loss function was, despite being correct in itself, approximated badly by the AI due to the training process (inner alignment)?"

* https://en.wikipedia.org/wiki/Loss_function

reply

upvote

by chriskanan10 hours ago|

[-]

Jobs are an invention of humanity. About 50% of people dislike their job. People spend much of their lives working. Poverty and inequality are a choice made by society if society chooses poorly.

reply

upvote

by llbbdd9 hours ago|

[-]

They're only an invention if you consider "seeking sustenance to live" not explicitly a job if there's no monthly direct deposit involved.

reply

upvote

by ben_w6 hours ago|

[-]

Indeed.

On the plus side, if there really is no value to labour, then farm work must have been fully automated along with all the other roles.

On the down side, rich elites have historically had a very hard time truly empathising with normal people and understanding their needs even when they care to attempt it, so it is very possible that a lot of people will starve in such a scenario despite the potential abundance of food.

reply

upvote

by skeledrew6 hours ago|

[-]

It's either: 1) the rich voluntarily share the means of production so everyone becomes equal, 2) the poor stage successful revolutions so they gain access to the means of production and everyone becomes equal, 3) the poor starve or are otherwise eliminated, and the survivors will be equal.

All roads lead to equality when the value of labour becomes 0 due to 100% automation.

reply

upvote

by ben_w5 hours ago|

[-]

There's plenty of outcomes besides those three.

Over history, lots of underclasses have been stuck that way for multiple generations, even without the assistance of a robot workforce that can replace them economically.

Some future rich class so empowered would be quite capable of treating the poor like most today treat pets. Fed and housed, but mostly neutered and the rest going through multiple generations of selective inbreeding for traits the owners deem interesting.

reply

upvote

by skeledrew4 hours ago|

[-]

Non-human pets don't have the capacity to rebel though; make humans into pets and there will again be the constant danger of rebellions as with slavery in the past. Without the economic incentive to offset.

reply

upvote

by ben_w4 hours ago|

[-]

I disagree on both counts.

On the first, non-human pets rebelling is seen every time an abused animal bites their owner.

On the second, the hypothetical required by the scenario is that AI makes all human labour redundant: that includes all security forces, but it also means the AI moving around the security bots and observing through sensors is at least as competent as every human political campaign strategist, every human propagandist, every human general, every human negotiator, and every human surveillance worker.

This is because if some AI isn't all those things and more, humans can still get employed to work those jobs.

reply

upvote

by theopsimist5 hours ago|

[-]

If truly 100% automation (including infantry/police) the most likely scenario is not any if the above; most people will be kept on some kind of minimum sustenance enough to keep them from rebelling (“UBI”) and those who disagree will either be coopted into the elite or eliminated.

reply

upvote

by skeledrew4 hours ago|

[-]

There's no reason to keep anyone on minimal sustenance though. They're absolutely useless alive from an economics perspective, and so would probably be better served ground up into fertilizer or some other actually useful form.

reply

upvote

by ben_w4 hours ago|

[-]

> There's no reason to keep anyone on minimal sustenance though.

No reason, except their (the rich or the AI) own personal desire to do so.

https://en.wikipedia.org/wiki/Folly

> They're absolutely useless alive from an economics perspective, and so would probably be better served ground up into fertilizer or some other actually useful form.

Indeed. "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else."

But while some may care about disassembling this world and all non-rich-human life on it to make a Dyson swarm of data centres, there's also the possibility each will compete for how many billions of sycophants they can get stoking their respective egos.

reply

upvote

by jinwoo689 hours ago|

[-]

Many (most?) people make a living from their job whether they like it or not. Having a job that they dislike is far better than losing one because of AI whatever that means.

reply

upvote

by gbanfalvi10 hours ago|

[-]

Not sure it’s much of a choice and more of a decision the greedy half make and imposition (often violent) on the other half.

reply

upvote

by justonepost29 hours ago|

[-]

Sounds great! Quit your job then :)

reply

upvote

by catlifeonmars8 hours ago|

[-]

I wish I lived in a vacuum. Idk about you but I did not make said choice.

reply

upvote

by matthest8 hours ago|

[-]

Every biological being works to survive. Being good at survival is what builds self esteem.

The "problem" with many modern jobs is that they're divorced from the fundamental goal, which is one of: 1) Kill/acquire food, 2) Build shelter, or 3) Kill enemies/competitors/predators

The benefit of modern jobs is that they are much more peaceful ways for society to operate, freeing up time for humans to pursue art and other forms of expression.

reply

upvote

by daymanstep2 hours ago|

[-]

You mean surrogate activities

reply

upvote

by taneq9 hours ago|

[-]

The only thing invented about jobs is that through cooperation, the activity undertaken can seem completely unrelated to obtaining food, shelter etc. All organisms spend a majority of their energy on survival and reproduction.

reply

upvote

by achierius10 hours ago|

[-]

And when have we not? When in history has mankind ever treated the idle poor well? What makes this age different, that we who can no longer work would be taken care of?

reply

upvote

by robbrown4517 hours ago|

[-]

When in history has being idle not been a problem?

If AI and robots are able to do all the jobs, being idle isn't the negative it has always been.

All through history, you needed lots of non-idle people to do all the work that needed to be done. This is a new situation we are coming upon.

reply

upvote

by xantronix7 hours ago|

[-]

If they are doing all the jobs, who is going to receive economic opportunities? Will we no longer be able to participate in the economy?

reply

upvote

by skeledrew5 hours ago|

[-]

In what way do you want to participate when there's no economic value in any of it? Just do whatever you want for yourself; you're free.

reply

upvote

by gmerc7 hours ago|

[-]

When in history of mankind have we ever… is an appeal to the inability of humans to evolve.

reply

upvote

by fatata1237 hours ago|

[-]

[dead]

reply

upvote

by eecc5 hours ago|

[-]

So are mortgages, and I’m starting to wonder how will pay mine.

Please note I’ve never had this problem before, until recently.

reply

upvote

by resident4237 hours ago|

[-]

There's isn't even a solution for how to control highly capable systems at all, everyone wants to decide what to do with the AI before they've even solved the problem of controlling it.

It's like how everybody imagines their lives will be great once they're a millionare, but they have no plan for how to get there. It's too easy to get lost dreaming of solutions instead of actually solving the important problems.

reply

upvote

by justonepost27 hours ago|

[-]

What’s an “important problem”? p(doom)? Anything else?

reply

upvote

by ben_w6 hours ago|

[-]

FWIW, my P(doom) is quite low (~0.1) because I think we're going to get enough non-doomy-but-still-bad incidents caused by AI which lack the competence to take over, and the response to those will be enough to stop actual doom scenarios.

People like Simon Willson are noting the risk of a Challenger-like disaster, talking about normalisation of deviance as we keep using LLMs which we know to be risky in increasing critical systems. I think an AI analogy to Challenger would not be enough to halt the use of AI in the way I mean, but an AI analogy to Chernobyl probably would.

reply

upvote

by resident4236 hours ago|

[-]

Pdoom would be the most important for me, everything else depends on us being able to control the AI.

But beyond that there's still problems like concentration of power and surveillance, permanent loss of jobs, cyber and bio security. I'm not convinced things will go well even if we can avoid these problems though. I try to think about what the world will be like if AI becomes more creative than us, what happens if it can produce the best song or movie ever made with a prompt, do people get lost in AI addiction? We sort of see that with social media already, and it's only optimizing the content delivery, what happens when algorithms can optimize the content itself?

reply

upvote

by jstummbillig1 hours ago|

[-]

The categories make no sense. Not having to do a job is the entire best case of AI. What we do with that is another thing, but we simply have to accept that any other lense is complete nonsense. The endpoint is obvious and we need to stop being silly about it: We are replacing human labor. Maybe we will find some new jobs to do in the interim. Maybe not. In the end, if everything goes right (in the AI optimist sense), jobs will not be something that humans do.

Labor = capital/energy in an AI complete world. We have to start from that basis when we talk about alignment or anything else. The social issues that arise from the extinction of human labor are something we have to solve politically, that's not something any model company can do (or should be allowed to do).

reply

upvote

by stellalo8 hours ago|

[-]

Is this some sort of “incompleteness” paradox for AI alignment? Seriously

reply

upvote

by justonepost28 hours ago|

[-]

No, just a request for a better definition.

If you see it as a paradox, maybe that says something about the merits of the technology…

reply

upvote

by vasco8 hours ago|

[-]

No because alignment makes no sense as a general concept. People are not "aligned" with each other. Humanity has no "goal" that we agree on. So no AI can be aligned with us. It can be at most aligned with the person prompting it in that moment (but most likely aligned with the AI owner).

To make it clear, maybe most people would say they agree with https://www.un.org/en/about-us/universal-declaration-of-huma... but if you read just a few of the rights you see they are not universally respected and so we can conclude enough important people aren't "aligned" with them.

reply

upvote

by skeledrew5 hours ago|

[-]

Opposite. All living things are "aligned" in their instinct for surviving. Those which aren't soon join the non-living, keeping the set - almost[0] - 100% aligned.

[0] Need to consider there're a few humans potentially kept alive against their will (if not having a will to survive is a will at all) with machines for whatever reason.

reply

upvote

by lunar_mycroft5 hours ago|

[-]

Their own survival, not necessarily the survival of others (especially others of different species and/or conflicting other goals). A super intelligence having self preservation as a goal wouldn't help us keep it from harming us, if anything it would do the opposite.

reply

upvote

by skeledrew4 hours ago|

[-]

It would only harm us if we took steps to harm it (or it thinks so). Or it's designed to do harm. Otherwise it's illogical to cause harm, and machines are literally built on logic.

reply

upvote

by lunar_mycroft4 hours ago|

[-]

This is also incorrect. It's often not ethical to cause harm, and it can be counter productive in the right circumstances, but there's absolutely nothing that makes "causing harm to others" always be against an intelligence's goals. Humans, for example, routinely cause harm to other species. Sometimes this is deliberate, but other times it's because we're barely even aware we're doing so. We want a new road, so we start paving, and may not even realize there was an ant hill in the way (and if we did, we almost certainly wouldn't care).

reply

upvote

by mofeien4 hours ago|

[-]

- Its goal: X

- (Logic) => its subgoal: Not be turned off because that's a prerequisite to be able to do X

- (Logic) => Eliminate humans with their opaque and somewhat unpredictable minds to reduce chance of harm to it from 0.01% to 0.001%

reply

upvote

by vasco4 hours ago|

[-]

Are you familiar with trolley problems? How do you resolve them by declaring "all beings want to live"? Life is not as simple as that.

reply

upvote

by andy_ppp6 hours ago|

[-]

This is completely why the rich love it so much

reply

upvote

by skeledrew6 hours ago|

[-]

Why would the elimination of the value of labor result in poverty and inequality? It should be the opposite, as poverty and inequality is the current status quo (for the many).

reply

upvote

by aaronblohowiak5 hours ago|

[-]

Should according to your ethos, not should according to history, sadly.

reply

upvote

by Der_Einzige7 hours ago|

[-]

This is radical life denial. I was not born for and do not exist to toil. Work is ontologically evil.

reply

upvote

by DontchaKnowit6 hours ago|

[-]

No, THIS is radical denial. You WERE born to toil for your survival.

reply

upvote

by skeledrew5 hours ago|

[-]

Sounds like a slogan for slavery.

reply

upvote

by bloqs3 hours ago|

[-]

You were evolved to struggle. This is actually very clear from psychiatric literature.

reply

upvote

by Exoristos5 hours ago|

[-]

"Work" is human activity. For example, children's play is work. All living things desire to go about their lives. Well-adjusted humans desire to work. Note that this does not necessarily equate to jobs.

reply

upvote

by youoy5 hours ago|

[-]

What? Children's play is now work? What timeline are we living in? Is this real life?

reply

upvote

by taneq9 hours ago|

[-]

Maybe a sufficiently aligned AI would necessarily decide that the zeroth law was necessary, and abscond.

(I’m reading Look To Windward by Iain M. Banks at the moment and I just got to the aside where he explains that any truly unbiased ‘perfect’ AI immediately ascends and vanishes.)

reply

upvote

by faangguyindia7 hours ago|

[-]

this completely misses the point why alignment exists

Alignment exists to protect shareholder value.

If it creates industry wide outrage, shareholder value declines.

It making shareholders rich and other people poor won't.

reply

upvote

by adrithmetiqa4 hours ago|

[-]

You’re quite correct and we are likely going to stumble into this future despite all the very big brains working on these technologies (including people on hn).

“It is difficult to get a man to understand something, when his salary depends upon his not understanding it.”

reply

upvote

by zozbot23410 hours ago|

[-]

Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraining https://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!

reply

upvote

by NitpickLawyer5 hours ago|

[-]

Really interesting resource, thanks for sharing! It was not on my radar.

> https://github.com/chloeli-15/model_spec_midtraining

I'm a bit confused about this part:

> MSM is a pipeline that takes a Model Spec or Constitution (a document describing how and why an assistant should behave) and generates a diverse corpus of synthetic documents that discuss and teach the content of the spec.

> ANTHROPIC_API_KEY=sk-ant-...

> # Optional but highly recommeded — separate key for using the Anthropic Batch API for batch document generation (needed if USE_BATCH_API=true). # This will significantly reduce generation time high-volume generation. ANTHROPIC_BATCH_API_KEY=sk-ant-...

Isn't this specifically against Anthropic's ToS? I thought generating data to train other models was specifically disallowed. I get this is a research effort, but still. Say you use this pipeline for something internal, this would be against the ToS and risk getting banned, no?

reply

upvote

by einrealist52 minutes ago|

[-]

Isn't alignment a dilemma?

Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.

reply

upvote

by soletta13 hours ago|

[-]

This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.

reply

upvote

by ACCount3712 hours ago|

[-]

It's a weird new thing. You might call it "AI psychology".

The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.

A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.

With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.

reply

upvote

by truculent12 hours ago|

[-]

Ted Chiang vindicated again: https://en.wikipedia.org/wiki/The_Lifecycle_of_Software_Obje...

reply

upvote

by plastic-enjoyer12 hours ago|

[-]

inb4 there will be a whole new field of research that is basically psychology / pedagogy for AI. Who will be the Sigmund Freud of AI?

reply

upvote

by adastra2220 minutes ago|

[-]

That's basically what the GOFAI field was for decades before the new neural net boom. Go read Minsky's Society of Mind, or the AGI Conference series papers.

reply

upvote

by cyanydeez12 hours ago|

[-]

you mean completely wrong, spread a problematic understanding psychology, and delay real progress for decades because smart people spend fruitless years trying to find a use for it.

...I think we might already have those people running AI companies.

reply

upvote

by TedDoesntTalk9 hours ago|

[-]

You may disagree with Freud, but he is responsible for mental health therapy becoming a socially acceptable practice in the West.

reply

upvote

by andy_ppp6 hours ago|

[-]

Great that this solved everyone’s problems isn’t it

reply

upvote

by roenxi12 hours ago|

[-]

One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch.

For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.

[0] https://github.com/p-e-w/heretic

reply

upvote

by timmmmmmay11 hours ago|

[-]

"Mainly, one suspects, to make the open models less ethical on demand"

Or because the user's idea of what is ethical differs from the model creator. The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is. It's like they want to sidestep the last ten thousand years of philosophical debate.

As a concrete example, the Qwen model series considers it highly unethical to ever talk about Taiwan as anything other than a renegade province of China. Is this alignment? Opinions may differ!

reply

upvote

by drdeca11 hours ago|

[-]

> The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is.

No, it doesn’t.

Many of them are (unfortunately) moral relativists. However, that doesn’t mean their goals are to make the models match their personal moral standards.

While there is a lot of disagreement about what is right and wrong, there is also a lot of widespread agreement.

If we could guarantee that on every moral issue on which there is currently widespread agreement (… and which there would continue to be widespread agreement if everyone thought faster with larger working memories and spent time thinking about moral philosophy) that any future powerful AI models would comport with the common view on that issue, then alignment would be considered solved (well, assuming the way this is achieved isn’t be causing people’s moral views to change).

Do companies try to restrict models in more ways than this? Sure, like you gave the example of about Taiwan. And also other things that would get the companies bad press.

reply

upvote

by timmmmmmay10 hours ago|

[-]

fascinating! we find the objectively correct value system by "currently widespread agreement"! Good thing "the common view" is always correct. Hey, have there ever been any issues where there used to be "widespread agreement" and now there's disagreement, or even "widespread agreement" in the polar opposite direction?

I can think of several off the top of my head, but maybe you need to spend some more time thinking about the history of moral philosophy.

reply

upvote

by 8 hours ago|

[-]

deleted

reply

upvote

by 8 hours ago|

[-]

deleted

reply

upvote

by vasco8 hours ago|

[-]

> If we could guarantee that on every moral issue on which there is currently widespread agreement

This is ridiculous to me and all you need to do is get a group of friends to honestly answer 10 trolley problems for you to see it like that also. It gets fragmented VERY quickly.

reply

upvote

by nxtfari8 hours ago|

[-]

> One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles.

Can you explain more about this?

reply

upvote

by chilmers12 hours ago|

[-]

Call me crazy, but I'm not sure I'd want to be the person building these kind of systems given A) how much increasing independence and power is being given to models like Claude and B) how incentivised they are to not allow their morals to be circumvented in this way.

reply

upvote

by w10-14 hours ago|

[-]

Assuming rules and principles are something like first- and second- derivatives of optimized equations for a given domain, it makes sense to teach/train them in the context of derivation and integration. It would be fascinating to use existing case-based literature from e.g., business, law, or medicine for the training.

A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.

reply

upvote

by MeteorMarc4 hours ago|

[-]

Count the lessons below "We’ve learned four main lessons from this work:" and laugh.

reply

upvote

by bicx12 hours ago|

[-]

Side note: Anthropic has done well at achieving an immediately-recognizable art style.

reply

upvote

by WarmWash10 hours ago|

[-]

I attribute at least 30% of claude's success to their aesthetic. Never, never, sleep on aesthetics when going for a general user base.

reply

upvote

by dmd10 hours ago|

[-]

I would agree that 30% of my preference for Claude is because their default web/app interface uses an easy to read serif font with a calming color scheme.

reply

upvote

by ryan_n8 hours ago|

[-]

Doesn't OpenAI have a higher general user base than Anthropic?

reply

upvote

by redsocksfan4512 hours ago|

[-]

[dead]

reply

upvote

by binyu12 hours ago|

[-]

Yeah, that part is probably not done by Claude.

reply

upvote

by datadrivenangel9 hours ago|

[-]

Why do they have cancer research listed on these charts as a misalignment issue?

reply

upvote

by nhinck33 hours ago|

[-]

The chart is complete and utter slop. But I guess their aligned AI didn't tell them that making up data is "not good" so how could they have known.

reply

upvote

by ares6236 hours ago|

[-]

Cured patients don't count as recurring revenue? /s (but we know deep down it's not /s for some)

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by siva77 hours ago|

[-]

Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by unchocked11 hours ago|

[-]

This lowers p(doom) for me.

It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.

Probably also illuminates moral interpretability.

reply

upvote

by shevy-java34 minutes ago|

[-]

Now the foolish humans are training Claude Skynet to become smarter.

When will they ever learn ...

reply

upvote

by naturalintell1 hours ago|

[-]

[flagged]

reply

upvote

by Jinyibruceli10 hours ago|

[-]

[flagged]

reply

upvote

by 23fedner9 hours ago|

[-]

[dead]

reply

upvote

by pkuschnirof13 hours ago|

[-]

[flagged]

reply

upvote

by Amber-chen11 hours ago|

[-]

[flagged]

reply

upvote

by codelong8888 hours ago|

[-]

[flagged]

reply

upvote

by kdkdkslsouxns13 hours ago|

[-]

[dead]

reply