undefined

points

[-]

They just showed the benchmarks it improved on but it regressed on so much more, such as the MCRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."

by merlindru21 hours ago|

prev|

[-]

Same. 4.7 felt like a definite regression

by supern0va21 hours ago|

parent|

[-]

Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.

by gAI21 hours ago|

parent|

[-]

It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.

by bombcar21 hours ago|

parent|

[-]

Claude got very mad at me and burned more tokens than exist to complain about me asking about a "yellow background cell" in an excel spreadsheet.

by forshaper21 hours ago|

parent|

[-]

Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.

haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."

by gAI20 hours ago|

parent|

[-]

Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.

https://www.anthropic.com/research/persona-selection-model

https://www.anthropic.com/research/assistant-axis

https://www.anthropic.com/research/emergent-misalignment-rew...

https://www.anthropic.com/research/emotion-concepts-function

by hashmap18 hours ago|

parent|

[-]

The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".

by ACCount3721 hours ago|

parent|

prev|

[-]

4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.

by b--l15 hours ago|

parent|

[-]

Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.

by ACCount3715 hours ago|

parent|

[-]

You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.

That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.

by throwatdem1231116 hours ago|

parent|

prev|

[-]

4.7 was just them starting on the path on getting prices in line with the actual cost

Make it dumber. Charge more (by changing the tokenizer). Call it the latest and greatest. Reset expectations.

by ruairidhwm16 hours ago|

prev|

[-]

I managed to find that Haiku outperformed Sonnet on some tasks...don't want to blog spam but if anyone is interested: https://www.ruairidh.dev/blog/sonnet-4-6-drops-format-rule-o...

by sonink3 hours ago|

prev|

[-]

Same here - we never bumped to 4.7 in our agentic app. Continue to use 4.6.

by petterroea20 hours ago|

prev|

[-]

Same. 4.7 has done some incredibly stupid things.

by dbbk18 hours ago|

parent|

[-]

I think this is a more a consequence of the introduction of adaptive thinking and removal of extended thinking, than 4.7 specifically

by rhubarbtree21 hours ago|

prev|

[-]

Same. So happy when I found that option.

by gAI21 hours ago|

parent|

[-]

Unfortunately, looks like 4.6 is now gone from the web ui.

by lukan21 hours ago|

parent|

[-]

Was bothered by that too, but did a magic trick and asked claude how to change that and .. there is

/model claude-opus-4-6

For this session and permanently (in shell):

export ANTHROPIC_MODEL=claude-opus-4-6

by tanepiper18 hours ago|

prev|

[-]

Yep, until 1st June 4.6 is still x1 on Copilot, but will jump up quite a bit in coat - 4.7 was already highly priced, and the output was frankly terrible.

It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.

I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.

by dezsirazvan18 hours ago|

prev|

[-]

same!