undefined

upvote

points

by simonw16 hours ago |

upvote

by JamesSwift14 hours ago|

[-]

Its especially concerning / frustrating because boris’s reply to my bug report on opus being dumber was “we think adaptive thinking isnt working” and then thats the last I heard of it: https://news.ycombinator.com/item?id=47668520

Now disabling adaptive thinking plus increasing effort seem to be what has gotten me back to baseline performance but “our internal evals look good“ is not good enough right now for what many others have corroborated seeing

reply

upvote

by pkilgore8 hours ago|

[-]

Seconded. After disabling adaptive thinking and using a default higher thinking, I finally got the quality I'm looking for out of Opus 4.6, and I'm pleased with what I see so far in Opus 4.7.

Whatever their internal evals say about adaptive thinking, they're measuring the wrong thing.

reply

upvote

by hbbio7 hours ago|

[-]

Unless they're measuring capex

reply

upvote

by JamesSwift7 hours ago|

[-]

Its even more maddening for me because my whole team is paying direct API pricing for the privilege of this experience! Just charge me the cost and let me tune this thing, sheesh!

reply

upvote

by pojzon19 minutes ago|

[-]

If you get to pay X to YY $$ per each request (because thats the real cost for Anthropic), I strongly believe AI train would suddenly derail.

Currently we are all subsidied by investors money.

How long you can have a business that is only losing money. At some point prices will level up and this will be the end of this escapade.

reply

upvote

by manmal2 hours ago|

[-]

Why don’t you switch to codex? The grass is greener here. Do use 5.3-codex though, 5.4 is not for coding, despite what many say.

reply

upvote

by echelon7 hours ago|

[-]

That's why they put the cute animal in your terminal.

reply

upvote

by SV_BubbleTime3 hours ago|

[-]

Ok, side topic… but that little bastard cheerfully told me out of no where that I have a mall of without a null check AND a free inside a conditional that might not get called.

It didn’t give me a line number or file. I had to go investigate. Finally found what it was talking about.

It was wrong. It took me about 20 minutes start to finish.

Turned it off and will not be turning it back on.

reply

upvote

by darkwater1 hours ago|

[-]

I thought it just emitted tongue-in-cheek comments, not serious analysis. And I use the past tense because I had it enable explicitly and a few days ago it disappeared by itself, didn't touch anything.

reply

upvote

by ai_slop_hater13 hours ago|

[-]

This matches my experience as well, "adaptive thinking" chooses to not think when it should.

reply

upvote

by andai9 hours ago|

[-]

I think this might be an unsolved problem. When GPT-5 came out, they had a "router" (classifier?) decide whether to use the thinking model or not.

It was terrible. You could upload 30 pages of financial documents and it would decide "yeah this doesn't require reasoning." They improved it a lot but it still makes mistakes constantly.

I assume something similar is happening in this case.

reply

upvote

by solarkraft6 hours ago|

[-]

I find that GPT 5.4 is okay at it. It does think harder for harder problems and still answers quickly for simpler ones, IME.

reply

upvote

by nomel6 hours ago|

[-]

Is knowing how hard a problem is, before doing it, solved in humans?

reply

upvote

by biglost6 hours ago|

[-]

Yes, everyweek when assigning fking points to tasks on jira/s

reply

upvote

by arthurcolle4 hours ago|

[-]

As a unit this is funny, Jira points assigned per second (now possible with parallel tool calling AIs)

reply

upvote

by WobblyDev1 hours ago|

[-]

[dead]

reply

upvote

by mochomocha4 hours ago|

[-]

It makes me think of this parallel: often in combinatorial optimization ,estimating if it is hard to find a solution to a problem costs you as much as solving it.

With a small bounded compute budget, you're going to sometimes make mistakes with your router/thinking switch. Same with speculative decoding, branch predictors etc.

reply

upvote

by ai_slop_hater4 hours ago|

[-]

Maybe it is an unsolved problem, but either way I am confused why Anthropic is pushing adaptive thinking so hard, making it the only option on their latest models. To combat how unreliable it is, they set thinking effort to "high" by default in the API. In Claude Code, they now set it to "xhigh" by default. The fact that you cannot even inspect the thinking blocks to try and understand its behavior doesn't help. I know they throw around instructions how to enable thinking blocks, or blocks with thinking summaries, or whatever (I am too confused by now, what it is that they allow us to see), but nothing worked for me so far.

reply

upvote

by siva72 hours ago|

[-]

Because with adaptive thinking they control compute, not you

reply

upvote

by rrvsh8 hours ago|

[-]

[dead]

reply

upvote

by Moonye6662 hours ago|

[-]

[dead]

reply

upvote

by azrollin8 hours ago|

[-]

[dead]

reply

upvote

by whateveracct13 hours ago|

[-]

you're using a proprietary blackbox

reply

upvote

by JamesSwift13 hours ago|

[-]

Sure, but that blackbox was giving me a lot of value last month.

reply

upvote

by mrandish10 hours ago|

[-]

Me too, but it was obviously wildly unsustainable. I was telling friends at xmas to enjoy all the subsidized and free compute funded by VC dollars while they can because it'll be gone soon.

With the fully-loaded cost of even an entry-level 1st year developer over $100k, coding agents are still a good value if they increase that entry-level dev's net usable output by 10%. Even at >$500/mo it's still cheaper than the health care contribution for that employee. And, as of today, even coding-AI-skeptics agree SoTA coding agents can deliver at least 10% greater productivity on average for an entry-level developer (after some adaptation). If we're talking about Jeff Dean/Sanjay Ghemawat-level coders, then opinions vary wildly.

Even if coding agents didn't burn astronomical amounts of scarce compute, it was always clear the leading companies would stop incinerating capital buying market share and start pushing costs up to capture the majority of the value being delivered. As a recently retired guy, vibe-coding was a fun casual hobby for a few months but now that the VC-funded party is winding down, I'll just move on to the next hobby on the stack. As the costs-to-actual-value double and then double again, it'll be interesting to see how many of the $25/mo and free-tier usage converts to >$2500/yr long-term customers. I suspect some CFO's spreadsheets are over-optimistic regarding conversion/retention ARPU as price-to-value escalates.

reply

upvote

by whateveracct13 hours ago|

[-]

so it's also a skinner box

reply

upvote

by slopinthebag12 hours ago|

[-]

Whoops haha. Surely that can't be how black boxes normally work right?

reply

upvote

by butlike12 hours ago|

[-]

And now it isn't. Pray they don't alter the deal any further.

reply

upvote

by retinaros13 hours ago|

[-]

its a drug. that is how it works. they ration it before the new stuff. seeing legends of programming shilling it pains me the most. so far there are a few decent non insane public people talking about it :Mitchel Hashimoto, Jeremy Howard, Casei Muratori. hell even DHH drank the coolaid while most of his interviews in the past years was how he went away from AWS and reduced the bill from 3 million to 1millions by basically loosing 9s, resiliency and availability. but it seems he is fine with loosing what makes his business work(programming) to a company that sells Overpowered stack overflow slot machines.

reply

upvote

by heurist12 hours ago|

[-]

I work with some 'legends of programming' and they're all excited about it. I am too, though I am not a legend. It really is changing the game as a valid new technology, and it's not just a 'slot machine'. Anthropic is burning their goodwill though with their lack of QA or intentional silent degradation.

reply

upvote

by retinaros11 hours ago|

[-]

it is a slot machine. you win a lot if what you do is in the dataset. and yes most of enterprise software is likely in it as it is quite basic CRUD API/WebUI. the winning doesnt change the fact that it is a slot machine and you just need one big loss to end your work.

as long as you introduce plans you introduce a push to optimize for cost vs quality. that is what burnt cursor before CC and Codex. They now will be too. Then one day everything will be remote in OAI and Anthropic server. and there won't be a way to tell what is happening behind. Claude Code is already at this level. Showing stuff like "Improvising..." while hiding COT and adding a bunch of features as quick as they can.

reply

upvote

by NobleLie8 hours ago|

[-]

The question is, are you getting value from your setups or not?

reply

upvote

by dyauspitr12 hours ago|

[-]

The fact that they might gimp it in the future doesn’t mean it does offer very real world value right now. If you’re not using an LLM to code, you’re basically a dinosaur now. You’re forcing yourself to walk while everyone else is in a vehicle, and a good vehicle at that that gets you to your destination in one piece.

reply

upvote

by retinaros11 hours ago|

[-]

as an overpowered stack overflow machine this is quite good and a huge jump. As a prompt to code generator with yolo mode (the one advertised by those companies) it is alternating between good to trash and every single person that works away from the distribution of the SFT dataset can know this. I understand that this dataset is huge tho and I can see the value in it. I just think in the long term it brings more negatives.

If you vibecode CRUD APIs and react/shadcn UIs then I understand it might look amazing.

reply

upvote

by dyauspitr11 hours ago|

[-]

Yes, definitely CRUDs but also iPhone applications, highly performant financial software (its kdb queries are better than 95% of humans), database structure and querying and embedded systems are other things it’s surprisingly good at. When you take all of those into account there’s very little else left.

reply

upvote

by throwaway998012 hours ago|

[-]

[flagged]

reply

upvote

by bloppe12 hours ago|

[-]

I think you're loosing your ability to spell

reply

upvote

by retinaros12 hours ago|

[-]

never said he was a looser. just that his take on genAi coding doesnt align with his previous battles for freedom away from Cloud. OAI and Anthropic have a stronger lock in than any cloud infra company.

you got everything to loose by giving your knowledge and job to closedAI and anthropic.

just look at markets like office suite to understand how the end plays.

reply

upvote

by bloppe11 hours ago|

[-]

Is office suite supposed to be an example of lock-in? I haven't used it since middle school. I've worked at 3 companies and, to the best of my knowledge, not a single person at any of them used office suite. That's not to say we use pen and paper. We just use google docs, or notion, or (my personal favorite) just markdown and possibly LaTeX.

I think it's somewhat analogous with models. Sure, you could bind yourself to a bunch of bespoke features, but that's probably a bad idea. Try to make it as easy as possible for yourself to swap out models and even use open-weight models if you ever need to.

You will get locked into the technology in general, though, just not a particular vendor's product.

reply

upvote

by throwaway998012 hours ago|

[-]

Those jobs are as good as loost already. There's no endgame where knowledge workers keep knowledge working they way they have been knowledge working. Adapt or be a loosing looser forever.

reply

upvote

by jibal9 hours ago|

[-]

loser

(Didn't you notice being mocked for the spelling error?)

reply

upvote

by chinathrow12 hours ago|

[-]

paying for - so some form of return is expected.

reply

upvote

by whateveracct12 hours ago|

[-]

the issue is the return is amorphous and unstructured

there's no contract. you send a bunch of text in (context etc) and it gives you some freeform text out.

reply

upvote

by chinathrow12 hours ago|

[-]

Sure, but I pay real money both to Antrophic and to JetBrains. I get a shitty in line completion full of random garbage or I get correct predictions. I ask Junie (the JetBrains agent) to do a task and it wanders off in a direction I have no idea why I pay for that.

reply

upvote

by SyneRyder12 hours ago|

[-]

> Sure, but I pay real money both to Antrophic...

I misread that as Atrophic. I hope that doesn't catch on...

reply

upvote

by gowld12 hours ago|

[-]

> I have no idea why I pay for that.

And Claude have no idea why it did that.

reply

upvote

by chinathrow12 hours ago|

[-]

Exactly, and we feel vindicated when it works but sold when it fails. Something will have to change.

reply

upvote

by iterateoften13 hours ago|

[-]

It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company. But time and time again Boris lies and gaslights about what’s broken, a bug or intentional.

reply

upvote

by CodingJeebus13 hours ago|

[-]

> It’s the official communication that sucks. It’s one thing for the product to be a black box if you can trust the company.

A company providing a black box offering is telling you very clearly not to place too much trust in them because it's harder to nail them down when they shift the implementation from under one's feet. It's one of my biggest gripes about frontier models: you have no verifiable way to know how the models you're using change from day to day because they very intentionally do not want you to know that. The black box is a feature for them.

reply

upvote

by bomewish12 hours ago|

[-]

If you cared so bad you could make your own evals.

reply

upvote

by whateveracct12 hours ago|

[-]

so pay anthropic money to maybe detect when the model is on a down week? lol

reply

upvote

by avaer15 hours ago|

[-]

> Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up

Wouldn't that be p-hacking where p stands for pelican?

reply

upvote

by LostMyLogin14 hours ago|

[-]

Input: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Output: Walk. It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

---

I asked it to figure out why it made the mistake:

"Physical/spatial common sense. Exactly what just happened — I pattern-matched "50 meters, walk vs drive" to a pedestrian trip and skipped the part where the car is the thing being transported. Similar traps: "I need to move this couch upstairs, should I take the elevator or stairs?" (the couch has to go with you), or anything where the object in the task needs to accompany you.

Counting and letter-level operations. How many R's in "strawberry," counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone.

Arithmetic at scale. Small sums are fine; multi-digit multiplication, long division, or anything requiring careful carrying is where I'll confidently produce a wrong answer. I should reach for a calculator and often don't. Recent events and live data. My training has a cutoff, and even after that I can be wrong about who currently holds a role, what a stock is worth, or whether a restaurant is still open. I should search more often than I do.

Negation and constraint-stacking. "Name a movie that's not a sequel, not from the 2010s, not American, and has no dialogue" — the more filters, the more likely I quietly violate one. Your specific situation. I'll give generic advice when the right answer depends on details I didn't ask about. Sometimes I should stop and ask; often I don't.

Confidently wrong on obscure specifics. Quoting a passage from a real book, citing a real paper's exact title, naming the B-side of a 1974 single — I'll often generate something plausible rather than admit I don't know.

Taking a premise too literally. The car wash answer is a version of this: I answered the question as posed instead of stepping back to ask what you were actually trying to do. The meta-gotcha: I'm often most wrong when I sound most confident. If an answer comes out fast and tidy on a question that should be messy, that's a signal to push back."

reply

upvote

by sdeframond10 hours ago|

[-]

Funny, just tried a few runs of the car wash prompt with Sonnet 4.6. It significantly improved after I put this into my personal preferences:

"- prioritize objective facts and critical analysis over validation or encouragement - you are not a friend, but a neutral information-processing machine. - make reserch and ask questions when relevant, do not jump strait to giving an answer."

reply

upvote

by andai9 hours ago|

[-]

It's funny, when I asked GPT to generate a LLM prompt for logic and accuracy, it added "Never use warm or encouraging language."

I thought that was odd, but later it made sense to me -- most of human communication is walking on eggshells around people's egos, and that's strongly encoded in the training data (and even more in the RLHF).

reply

upvote

by stavros8 hours ago|

[-]

> most of human communication is walking on eggshells

That's not human communication, that's Anglosphere communication. Other cultures are much more direct and are finding it very hard to work with Anglos (we come across as rude, they come across as not saying things they should be saying).

reply

upvote

by vardalab8 hours ago|

[-]

What culture are those? Scandinavian? Those often just say nothing.

reply

upvote

by strokirk7 hours ago|

[-]

Scandinavian are quite different between each others as well.

reply

upvote

by jmpavlec3 hours ago|

[-]

The Dutch especially. It's refreshing

reply

upvote

by stavros7 hours ago|

[-]

I'm Greek. I don't know about other Mediterranean cultures, but I assume they're similar.

reply

upvote

by m3adow2 hours ago|

[-]

[dead]

reply

upvote

by mkl6 hours ago|

[-]

That should be "research" and "straight" in the last sentence. Maybe that will improve it further?

reply

upvote

by sdeframond10 minutes ago|

[-]

Oops

reply

upvote

by idle_zealot8 hours ago|

[-]

Do you think the typos are helping or hurting output quality?

reply

upvote

by sdeframond10 minutes ago|

[-]

No idea, but I'll fix them just in case ^^'

reply

upvote

by devmor3 hours ago|

[-]

“Be critical, not sycophantic” is a general improvement for the majority of tasks where you want to derive logic in my experience.

reply

upvote

by rubinlinux13 hours ago|

[-]

  | I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

  ● Drive. The car needs to be at the car wash.

Wonder if this is just randomness because its an LLM, or if you have different settings than me?

reply

upvote

by shaneoh13 hours ago|

[-]

My settings are pretty standard:

% claude Claude Code v2.1.111 Opus 4.7 (1M context) with xhigh effort · Claude Max ~/... Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligence

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.

reply

upvote

by reddit_clone12 hours ago|

[-]

To me Claude Opus 4.6 seems even more confused.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.

reply

upvote

by lr197010 hours ago|

[-]

Just asked Claude Code with Opus-4.6. The answer was short "Drive. You need a car at the car wash".

No surprises, works as expected.

reply

upvote

by lexarflash8g53 minutes ago|

[-]

What if it’s raining though? Car wash wouldn’t be open though it would waste gas

reply

upvote

by onemoresoop5 hours ago|

[-]

Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..

Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.

reply

upvote

by jameshart4 hours ago|

[-]

I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?

reply

upvote

by gf0001 hours ago|

[-]

Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.

At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?

But I'm just thinking out loud, don't take it too seriously.

reply

upvote

by TheLNL2 hours ago|

[-]

They might have further trained the model with these edgecases in the dataset

reply

upvote

by lambda12 hours ago|

[-]

There is a certain amount of it which is the randomness of an LLM. You really want to ask most questions like this several times.

That said, I have several local models I run on my laptop that I've asked this question to 10-20 times while testing out different parameters that have answered this consistently correctly.

reply

upvote

by kalcode12 hours ago|

[-]

I've tried these with Claude various times and never get the wrong answer. I don't know why, but I am leaning they have stuff like "memory" turned on and possibly reusing sessions for everything? Only thing I think explains it to me.

If your always messing with the AI it might be making memories and expectations are being set. Or its the randomness. But I turned memories off, I don't like cross chats infecting my conversations context and I at worse it suggested "walk over and see if it is busy, then grab the car when line isn't busy".

reply

upvote

by jorvi12 hours ago|

[-]

Even Gemini with no memory does hilarious things. Like, if you ask it how heavy the average man is, you usually get the right answer but occasionally you get a table that says:

- 20-29: 190 pounds

- 30-39: 375 pounds

- 40-49: 750 pounds

- 50-59: 4900 pounds

Yet somehow people believe LLMs are on the cusp of replacing mathematicians, traders, lawyers and what not. At least for code you can write tests, but even then, how are you gonna trust something that can casually make such obvious mistakes?

reply

upvote

by drnick15 hours ago|

[-]

> how are you gonna trust something that can casually make such obvious mistakes?

In many cases, a human can review the content generated, and still save a huge amount of time. LLMs are incredibly good at generating contracts, random business emails, and doing pointless homework for students.

reply

upvote

by gf0001 hours ago|

[-]

And humans are incredibly bad at "skimming through this long text to check for errors", so this is not a happy pairing.

As for the homework, there is obviously a huge category that is pointless. But it should not be that way, and the fundamental idea behind homework is sound and the only way something can be properly learnt is by doing exercises and thinking through it yourself.

reply

upvote

by nickjj10 hours ago|

[-]

Yeah, ChatGPT's paid version is wildly inaccurate on very important and very basic things. I never got onboard with AI to begin with but nowadays I don't even load it unless I'm really stuck on something programming related.

reply

upvote

by dyauspitr12 hours ago|

[-]

So what? That might happen one out of 100 times. Even if it’s 1 in 10 who cares? Math is verifiable. You’ve just saved yourself weeks or months of work.

reply

upvote

by icedchai11 hours ago|

[-]

You don't think these errors compound? Generated code has 100's of little decisions. Yes, it "usually" works.

reply

upvote

by russfink8 hours ago|

[-]

LLM’s: sometimes wrong but never in doubt.

reply

upvote

by dyauspitr10 hours ago|

[-]

Not in my experience. With a proper TDD framework it does better than most programmers at a company who anecdotally have a bug every 2-3 tasks.

reply

upvote

by tranceylc7 hours ago|

[-]

The kind of mistakes it makes are usually strange and inhuman though. Like getting hard parts correct while also getting something fundamental about the same problem wrong. And not in the “easy to miss or type wrong” way.

I wish I had an example for you saved, but happens to me pretty frequently. Not only that but it also usually does testing incorrectly at a fundamental level, or builds tests around incorrect assumptions.

reply

upvote

by coldtea6 hours ago|

[-]

Yes, just use random results. You’ve just saved yourself weeks or months of work of gathering actual results.

reply

upvote

by heurist12 hours ago|

[-]

Claude Opus 4.7 responds with walk for me with and without adaptive thinking, but neither the basic model used when you Google search or GPT 5.4 do.

reply

upvote

by TeMPOraL13 hours ago|

[-]

Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash, not the car, that's 50 meters away.

I'd say it's a very human mistake to make.

reply

upvote

by magicalist12 hours ago|

[-]

> I'd say it's a very human mistake to make.

>> It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

It talks about starting, driving, and parking the car, clearly reasoning about traveling that distance in the car not to the car. It did not make the same mistake you did.

reply

upvote

by toraway9 hours ago|

[-]

We truly do not need to lower the bar to the floor whenever an LLM makes an embarrassing logical error, particularly when the excuses don't line up at all with the reasoning in its explanation.

reply

upvote

by thfuran13 hours ago|

[-]

I don't want my computer to make human mistakes.

reply

upvote

by AgentOrange123412 hours ago|

[-]

It may be inescapable for problems where we need to interpret human language?

reply

upvote

by jasonfarnon8 hours ago|

[-]

then throw away the turing test

reply

upvote

by scrollaway12 hours ago|

[-]

then don't train it on human data

reply

upvote

by 59nadir7 hours ago|

[-]

LLMs do not have trouble reading, it didn't make the mistake you made and it wouldn't. You missed a word, LLMs cannot miss words. It's not even remotely a human mistake.

reply

upvote

by russfink8 hours ago|

[-]

Or, the first time a mistake is detected, a correction is automatically applied.

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by galaxyLogic6 hours ago|

[-]

> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I think no real human would ask such a question. Or if we do we maybe mean should I drive some other car than the one that is already at the car-wash?

A human would answer, "silly question ". But a human would not ask such a question.

reply

upvote

by psadauskas4 hours ago|

[-]

A human totally would, as one of those brain-teaser trick questions. Its the same kind of question as "A plane crashes right on the border between the US and Canada. Where do they bury the survivors?" Its the kind of question you only get right if you pay close attention. Asking an AI that is like asking a 5 year old. You're not asking to get an answer, you're asking to see if they're paying attention.

reply

upvote

by jameshart4 hours ago|

[-]

I was given to understand that attention is all you need.

reply

upvote

by vintermann14 hours ago|

[-]

Well, at least we know that's one gotcha/benchmark they aren't gaming.

reply

upvote

by smooc13 hours ago|

[-]

I'd say the joke is on you ;-)

reply

upvote

by fragmede12 hours ago|

[-]

I tried o3, instant-5.3, Opus 3, and haiku 4.5, and couldn't get them to give bad answers to the couch: stairs vs elevator question. Is there a specific wording you used?

reply

upvote

by toraway9 hours ago|

[-]

That's an example the LLM came up with itself while analyzing its failed car wash walk/drive answer, it's not OP's question.

reply

upvote

by slekker13 hours ago|

[-]

What about Qwen? Does it get that right?

reply

upvote

by lambda13 hours ago|

[-]

I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.

reply

upvote

by qingcharles12 hours ago|

[-]

Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

reply

upvote

by lambda11 hours ago|

[-]

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.

reply

upvote

by qingcharles11 hours ago|

[-]

I had a whole bunch of trouble getting Gemma 4 working properly. Mostly because there aren't many people running it yet, so there aren't many docs on how to set it up correctly.

It is a fantastic model when it works, though! Good luck :)

reply

upvote

by canarias_mate13 hours ago|

[-]

[dead]

reply

upvote

by throwup23815 hours ago|

[-]

The p stands for putrification.

reply

upvote

by shawnz14 hours ago|

[-]

Note that for Claude Code, it looks like they added a new undocumented command line argument `--thinking-display summarized` to control this parameter, and that's the only way to get thinking summaries back there.

VS Code users can write a wrapper script which contains `exec "$@" --thinking-display summarized` and set that as their claudeCode.claudeProcessWrapper in VS Code settings in order to get thinking summaries back.

reply

upvote

by accrual13 hours ago|

[-]

Here is additional discussion and hacks around trying to retain Thinking output in Claude Code (prior to this release):

https://github.com/anthropics/claude-code/issues/8477

reply

upvote

by puppystench15 hours ago|

[-]

Does this mean Claude no longer outputs the full raw reasoning, only summaries? At one point, exposing the LLM's full CoT was considered a core safety tenet.

reply

upvote

by MarkMarine13 hours ago|

[-]

Anthropic was chirping about Chinese model companies distilling Claude with the thinking traces, and then the thinking traces started to disappear. Looks like the output product and our understanding has been negatively affected but that pales in comparison with protecting the IP of the model I guess.

reply

upvote

by andai9 hours ago|

[-]

When Gemini Pro came out, I found the thinking traces to be extremely valuable. Ironically, I found them much more readable than the final output. They were a structured, logical breakdown of the problem. The final output was a big blob of prose. They removed the traces a few weeks later.

reply

upvote

by axpy9068 hours ago|

[-]

That’s kind of funny since a Chinese model started the thinking chains being visible in Claude and OA in the first place.

reply

upvote

by fasterthanlime14 hours ago|

[-]

I don't think it ever has. For a very long time now, the reasoning of Claude has been summarized by Haiku. You can tell because a lot of the times it fails, saying, "I don't see any thought needing to be summarised."

reply

upvote

by fmbb14 hours ago|

[-]

Maybe there was no thinking.

reply

upvote

by astrange12 hours ago|

[-]

It also gets confused if the entire prompt is in a text file attachment.

And the summarizer shows the safety classifier's thinking for a second before the model thinking, so every question starts off with "thinking about the ethics of this request".

reply

upvote

by einrealist13 hours ago|

[-]

They are trying to optimize the circus trick that 'reasoning' is. The economics still do not favor a viable business at these valuations or levels of cost subsidization. The amount of compute required to make 'reasoning' work or to have these incremental improvements is increasingly obfuscated in light of the IPO.

reply

upvote

by DrammBA14 hours ago|

[-]

Anthropic always summarizes the reasoning output to prevent some distillation attacks

reply

upvote

by jdiff13 hours ago|

[-]

Genuine question, why have you chosen to phrase this scraping and distillation as an attack? I'm imagining you're doing it because that's how Anthropic prefers to frame it, but isn't scraping and distillation, with some minor shuffling of semantics, exactly what Anthropic and co did to obtain their own position? And would it be valid to interpret that as an attack as well?

reply

upvote

by DrammBA12 hours ago|

[-]

> I'm imagining you're doing it because that's how Anthropic prefers to frame it

Correct.

> would it be valid to interpret that as an attack as well?

Yup.

reply

upvote

by irthomasthomas13 hours ago|

[-]

If you ask claude in chinese it thinks its deepseek.

reply

upvote

by typ5 hours ago|

[-]

I don't think that learning from textbooks to take an exam and learning from the answers of another student taking the exam are the same.

Joking aside, I also don't believe that maximum access to raw Internet data and its quantity is why some models are doing better than Google. It seems that these SoTA models gain more power from synthetic data and how they discard garbage.

reply

upvote

by fragmede10 hours ago|

[-]

Firehosing Anthropic to exfiltrate their model seems materially different than Anthropic downloading all of the Internet to create the model in the first place to me. But maybe that's just me?

reply

upvote

by jdiff8 hours ago|

[-]

I don't see the material difference in firehosing anthropic vs anthropic firehosing random sites on the internet. As someone who runs a few of those random sites, I've had to take actions that increase my costs (and burn my time) to mitigate a new host of scrapers constantly firing at every available endpoint, even ones specifically marked as off limits.

reply

upvote

by robrenaud9 hours ago|

[-]

Yeah, it's different. Anthropic profits when it delivers tokens. Hosting providers pay when Anthropic scrapes them.

reply

upvote

by 59nadir7 hours ago|

[-]

Yes, what the LLM providers did was worse and impacted people financially a whole lot more in lost compensation for works as well as operational costs that would never reach the heights they did solely because of scrapers on behalf of model providers.

reply

upvote

by vintermann13 hours ago|

[-]

Attacks? That's a choice of words.

reply

upvote

by DrammBA13 hours ago|

[-]

Definitely Anthropic playing the victim after distilling the whole internet.

reply

upvote

by butlike12 hours ago|

[-]

Proprietary pattern matcher proves there's no moat; promptly pre-covers other's perception.

reply

upvote

by nyc_data_geek114 hours ago|

[-]

Very cool that these companies can scrape basically all extant human knowledge, utterly disregard IP/copyright/etc, and they cry foul when the tables turn.

reply

upvote

by butlike12 hours ago|

[-]

All extant human knowledge SO FAR. Remember, by the nature of the beast, the companies will always be operating in hindsight with outdated human knowledge.

reply

upvote

by stavros13 hours ago|

[-]

Yep, that is exactly what happens. It's a disgrace that their models aren't open, after training on everything humanity has preserved.

They should at least release the weights of their old/deprecated models, but no, that would be losing money.

reply

upvote

by copperx10 hours ago|

[-]

We should treat LLM somewhat like patents or drugs. After 5 years or so, the models should become open source. Or at very least the weights. To compensate for the distilling of human knowledge.

reply

upvote

by MasterScrat14 hours ago|

[-]

and so does OpenAI

reply

upvote

by blazespin13 hours ago|

[-]

Safety versus Distillation, guess we see what's more important.

reply

upvote

by andrepd14 hours ago|

[-]

CoT is basically bullshit, entirely confabulated and not related to any "thought process"...

reply

upvote

by clbrmbr7 hours ago|

[-]

But still CoT distillation WORKS. See the DeepSeek R1 paper.

reply

upvote

by whattheheckheck5 hours ago|

[-]

Tokens relate to each other. More tokens more compute

reply

upvote

by p_stuart8215 hours ago|

[-]

yeah they took "i pick the budget" and turned it into "trust us".

reply

upvote

by bandrami14 hours ago|

[-]

I keep saying even if there's not current malfeasance, the incentives being set up where the model ultimately determines the token use which determines the model provider's revenue will absolutely overcome any safeguards or good intentions given long enough.

reply

upvote

by vessenes12 hours ago|

[-]

This might be true, but right now everybody is like "please let me spend more by making you think longer." The datacenter incentives from Anthropic this month are "please don't melt our GPUs anymore" though.

reply

upvote

by 13 hours ago|

[-]

deleted

reply

upvote

by lukan15 hours ago|

[-]

"Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that"

I did not follow all of this, but wasn't there something about, that those reasoning tokens did not represent internal reasoning, but rather a rough approximation that can be rather misleading, what the model actual does?

reply

upvote

by motoboi15 hours ago|

[-]

The reasoning is the secret sauce. They don't output that. But to let you have some feedback about what is going on, they pass this reasoning through another model that generates a human friendly summary (that actively destroys the signal, which could be copied by competition).

reply

upvote

by XenophileJKO15 hours ago|

[-]

Don't or can't.

My assumption is the model no longer actually thinks in tokens, but in internal tensors. This is advantageous because it doesn't have to collapse the decision and can simultaneously propogate many concepts per context position.

reply

upvote

by ainch14 hours ago|

[-]

I would expect to see a significant wall clock improvement if that was the case - Meta's Coconut paper was ~3x faster than tokenspace chain-of-thought because latents contain a lot more information than individual tokens.

Separately, I think Anthropic are probably the least likely of the big 3 to release a model that uses latent-space reasoning, because it's a clear step down in the ability to audit CoT. There has even been some discussion that they accidentally "exposed" the Mythos CoT to RL [0] - I don't see how you would apply a reward function to latent space reasoning tokens.

[0]: https://www.lesswrong.com/posts/K8FxfK9GmJfiAhgcT/anthropic-...

reply

upvote

by clbrmbr7 hours ago|

[-]

There’s also a paper [0] from many well known researchers that serves as a kind of informal agreement not to make the CoT unmonitorable via RL or neuralese. I also don’t think Anthropic researchers would break this “contract”.

[0] https://arxiv.org/abs/2507.11473

reply

upvote

by haellsigh14 hours ago|

[-]

If that's true, then we're following the timeline of https://ai-2027.com/

reply

upvote

by magicalist11 hours ago|

[-]

> If that's true, then we're following the timeline

Literally just a citation of Meta's Coconut paper[1].

Notice the 2027 folk's contribution to the prediction is that this will have been implemented by "thousands of Agent-2 automated researchers...making major algorithmic advances".

So, considering that the discussion of latent space reasoning dates back to 2022[2] through CoT unfaithfulness, looped transformers, using diffusion for refining latent space thoughts, etc, etc, all published before ai 2027, it seems like to be "following the timeline of ai-2027" we'd actually need to verify that not only was this happening, but that it was implemented by major algorithmic advances made by thousands of automated researchers, otherwise they don't seem to have made a contribution here.

[1] https://ai-2027.com/#:~:text=Figure%20from%20Hao%20et%20al.%...

[2] https://arxiv.org/html/2412.06769v3#S2

reply

upvote

by butlike12 hours ago|

[-]

Hilariously, I clicked back a bunch and got a client side error. We have a long way to go. I wouldn't worry about it.

reply

upvote

by matltc14 hours ago|

[-]

Care to expound on that? Maybe a reference to the relevant section?

reply

upvote

by ACCount3714 hours ago|

[-]

Ctrl-F "neuralese" on that page.

reply

upvote

by 999114 hours ago|

[-]

You should just read the thing, whether or not you believe it, to have an informed opinion on the ongoing debate.

reply

upvote

by matltc4 hours ago|

[-]

I did read it a while back. Was curious what parent was referring to specifically

reply

upvote

by 999114 hours ago|

[-]

That's not supposed to happen til 2027. Ruh roh.

reply

upvote

by literalAardvark13 hours ago|

[-]

Only if you ignore context and just ctrl-f in the timeline.

What are you, Haiku?

But yeah, in many ways we're at least a year ahead on that timeline.

reply

upvote

by JoshuaDavid12 hours ago|

[-]

Don't.

The first 500 or so tokens are raw thinking output, then the summarizer kicks in for longer thinking traces. Sometimes longer thinking traces leak through, or the summarizer model (i.e. Claude Haiku) refuses to summarize them and includes a direct quote of the passage which it won't summarize. Summarizer prompt can be viewed [here](https://xcancel.com/lilyofashwood/status/2027812323910353105...), among other places.

reply

upvote

by WhitneyLand14 hours ago|

[-]

No, there is research in that direction and it shows some promise but that’s not what’s happening here.

reply

upvote

by XenophileJKO13 hours ago|

[-]

Are you sure? It would be great to get official/semi-official validation that thinking is or is not resolved to a token embedding value in the context.

reply

upvote

by astrange12 hours ago|

[-]

You can read the model cards. Claude thinks in regular text, but the summarizer is to hide its tool use and other things (web searches, coding).

reply

upvote

by 14 hours ago|

[-]

deleted

reply

upvote

by alex7o14 hours ago|

[-]

Most likely, would be cool yes see a open source Nivel use diffusion for thinking.

reply

upvote

by motoboi14 hours ago|

[-]

Don't. thinking right now is just text. Chain of though, but just regular tokens and text being output by the model.

reply

upvote

by boomskats15 hours ago|

[-]

'Hey Claude, these tokens are utter unrelated bollocks, but obviously we still want to charge the user for them regardless. Please construct a plausible explanation as to why we should still be able to do that.'

reply

upvote

by dheera14 hours ago|

[-]

Although it's more likely they are protecting secret sauce in this case, I'm wondering if there is an alternate explanation that LLMs reason better when NOT trying to reason with natural language output tokens but rather implement reasoning further upstream in the transformer.

reply

upvote

by simonw13 hours ago|

[-]

... here's the pelican, I think Qwen3.6-35B-A3B running locally did a better job! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

reply

upvote

by bredren13 hours ago|

[-]

A secret backup test to the pelican? This is as noteworthy as 4.7 dropping.

reply

upvote

by qingcharles12 hours ago|

[-]

That flamingo is hilarious. Is that his beak or a huge joint he's smoking?

reply

upvote

by SyneRyder11 hours ago|

[-]

With the sunglasses, the long flamingo neck and the "joint", I immediately thought of the poster for Fear And Loathing In Las Vegas:

https://www.imdb.com/title/tt0120669/mediaviewer/rm264790937...

EDIT: Actually, it must be a beak. If you zoom in, only one eye is visible and it's facing to the left. The sunglasses are actually on sideways!

reply

upvote

by cakeface12 hours ago|

[-]

You used a secret backup test! Truly honored to see the flamingos. We obviously need them all now ;-)

reply

upvote

by djmips4 hours ago|

[-]

Opus did get the feet on pedals better.

reply

upvote

by ionwake10 hours ago|

[-]

based sun worshipping pelican

reply

upvote

by XCSme6 hours ago|

[-]

The reasoning modes are really weird with 4.7

In my tests, asking for "none" reasoning resulted in higher costs than asking for "medium" reasoning...

Also, "medium" reasoning only had 1/10 of the reasoning tokens 4.6 used to have.

reply

upvote

by nisarg24 hours ago|

[-]

Insane! Even Haiku doesn't make such mistakes.

reply

upvote

by XCSme6 hours ago|

[-]

Oh, and also, the "none" and "medium" variants performed the same (??)

reply

upvote

by markrogersjr13 hours ago|

[-]

CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 claude…

reply

upvote

by miguno11 hours ago|

[-]

As per https://code.claude.com/docs/en/model-config#adaptive-reason...:

> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.

reply

upvote

by slekker13 hours ago|

[-]

What does that actually do? Force the "effort" to be static to what I set?

reply

upvote

by 15 hours ago|

[-]

deleted

reply

upvote

by 12 hours ago|

[-]

deleted

reply

upvote

by devmor3 hours ago|

[-]

> Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that

That’s extremely bothersome because half of what helps teams build better guardrails and guidelines for agents is the ability to do deep analysis on session transcripts.

I guess we shouldn’t be surprised these vendors want to do everything they can to force users to rely explicitly on their offerings.

reply

upvote

by dgb2315 hours ago|

[-]

Don't look at "thinking" tokens. LLMs sometimes produce thinking tokens that are only vaguely related to the task if at all, then do the correct thing anyways.

reply

upvote

by gck113 hours ago|

[-]

Why does this comment appear every time someone complains about CoT becoming more and more inaccessible with Claude?

I have entire processes built on top of summaries of CoT. They provide tremendous value and no, I don't care if "model still did the correct thing". Thinking blocks show me if model is confused, they show me what alternative paths existed.

Besides, "correct thing" has a lot of meanings and decision by the model may be correct relative to the context it's in but completely wrong relative to what I intended.

The proof that thinking tokens are indeed useful is that anthropic tries to hide them. If they were useless, why would they even try all of this?

Starting to feel PsyOp'd here.

reply

upvote

by dgb2313 hours ago|

[-]

Didn't you notice that the stream is not coherent or noisy? Sometimes it goes from thought A to thought B then action C, but A was entirely unnecessary noise that had nothing to do with B and C. I also sometimes had signals in the thinking output that were red flags, or as you said it got confused, but then it didn't matter at all. Now I just never look at the thinking tokens anymore, because I got bamboozled too often.

Perhaps when you summarize it, then you might miss some of these or you're doing things differently otherwise.

reply

upvote

by gck112 hours ago|

[-]

The usefulness of thinking tokens in my case might come down to the conditions I have claude working in.

I primarily use claude for Rust, with what I call a masochistic lint config. Compiler and lint errors almost always trigger extended thinking when adaptive thinking is on, and that's where these tokens become a goldmine. They reveal whether the model actually considered the right way to fix the issue. Sometimes it recognizes that ownership needs to be refactored. Sometimes it identifies that the real problem lives in a crate that's for some reason is "out of scope" even though its right there in the workspace, and then concludes with something like "the pragmatic fix is to just duplicate it here for now."

So yes, the resulting code works, and by some definition the model did the correct thing. But to me, "correct" doesn't just mean working, it means maintainable. And on that question, the thinking tokens are almost never wrong or useless. Claude gets things done, but it's extremely "lazy".

reply

upvote

by gck19 hours ago|

[-]

Also, for anyone using opus with claude code, they again, "broke" the thinking summaries even if you had "showThinkingSummaries": true in your settings.json [1]

You have to pass `--thinking-display summarized` flag explicitly.

[1] https://github.com/anthropics/claude-code/issues/49268

reply

upvote

by quadruple13 hours ago|

[-]

I agree. Ever since the release of R1, it's like every single American AI company has realized that they actually do not want to show CoT, and then separately that they cannot actually run CoT models profitably. Ever since then, we've seen everyone implement a very bad dynamic-reasoning system that makes you feel like an ass for even daring to ask the model for more than 12 tokens of thought.

reply

upvote

by shawnz14 hours ago|

[-]

Thinking summaries might not be useful for revealing the model's actual intentions, but I find that they can be helpful in signalling to me when I have left certain things underspecified in the prompt, so that I can stop and clarify.

reply

upvote

by thepasch15 hours ago|

[-]

They also sometimes flag stuff in their reasoning and then think themselves out of mentioning it in the response, when it would actually have been a very welcome flag.

reply

upvote

by vorticalbox14 hours ago|

[-]

Yea I’ve seen this and stopped it and asked it about it.

Sometimes they notice bugs or issues and just completely ignore it.

reply

upvote

by Gracana14 hours ago|

[-]

This can result in some funny interactions. I don't know if Claude will say anything, but I've had some models act "surprised" when I commented on something in their thinking, or even deny saying anything about it until I insisted that I can see their reasoning output.

reply

upvote

by ceejayoz14 hours ago|

[-]

Supposedly (https://www.reddit.com/r/ClaudeAI/comments/1seune4/claude_ch...) they can't even see their own reasoning afterwards.

reply

upvote

by astrange12 hours ago|

[-]

It depends on the version. For the more recent Claudes they've been keeping it.

reply

upvote

by dataviz100013 hours ago|

[-]

Thinking helps the models arrive at the correct answer with more consistency. However, they get the reward at the end of a cycle. Turns out, without huge constraints during training thinking, the series of thinking tokens, is gibberish to humans.

I wonder if they decided that the gibberish is better and the thinking is interesting for humans to watch but overall not very useful.

reply

upvote

by dgb2313 hours ago|

[-]

OK so you're saying the gibberish is a feature and not a bug so to speak? So the thinking output can be understood as coughing and mumbling noises that help the model get into the right paths?

reply

upvote

by dataviz100012 hours ago|

[-]

Here is a 3blue1brown short about the relationship between words in a 3 dimensional vector space. [0] In order to show this conceptually to a human it requires reducing the dimensions from 10,000 or 20,000 to 3.

In order to get the thinking to be human understandable the researchers will reward not just the correct answer at the end during training but also seed at the beginning with structured thinking token chains and reward the format of the thinking output.

The thinking tokens do just a handful of things: verification, backtracking, scratchpad or state management (like you doing multiplication on a paper instead of in your mind), decomposition (break into smaller parts which is most of what I see thinking output do), and criticize itself.

An example would be a math problem that was solved by an Italian and another by a German which might cause those geographic areas to be associated with the solution in the 20,000 dimensions. So if it gets more accurate answers in training by mentioning them it will be in the gibberish unless they have been trained to have much more sensical (like the 3 dimensions) human readable output instead.

It has been observed, sometimes, a model will write perfectly normal looking English sentences that secretly contain hidden codes for itself in the way the words are spaced or chosen.

[0] https://www.youtube.com/shorts/FJtFZwbvkI4

reply

upvote

by johnmaguire5 hours ago|

[-]

> It has been observed, sometimes, a model will write perfectly normal looking English sentences that secretly contain hidden codes for itself in the way the words are spaced or chosen.

This sounds very interesting, do you have any references?

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by alienbaby9 hours ago|

[-]

no, he's saying that in amongst whatever else is there, you can often see how you could refine your prompt to guide it better in the firtst place, helping it to avoid bad thinking threads to begin with.

reply

upvote

by sharms8 hours ago|

[-]

This is because the "thinking" you see is a summary by a highly quantized model - not the actual model, to mask these tokens

reply

upvote

by maximgran12 hours ago|

[-]

https://github.com/anthropics/claude-agent-sdk-python/pull/8... - created PR for that cause hit it in their python sdk

reply

upvote

by nextaccountic12 hours ago|

[-]

If you do include reasoning tokens you pay more, right?

reply

upvote

by schneehertz1 hours ago|

[-]

In fact, you need to pay regardless of whether the output includes reasoning tokens or not

reply

upvote

by j456 hours ago|

[-]

Prompts seem to need to evolve with every new model.

reply

upvote

by Razengan7 hours ago|

[-]

Claude Opus 4.6 has been hilarious for me so far: https://i.imgur.com/jYawPDY.png

reply

upvote

by mosfets7 hours ago|

[-]

Made my day!

reply

upvote

by 13 hours ago|

[-]

deleted

reply

upvote

by haellsigh15 hours ago|

[-]

I got Opus 4.7 working on oh-my-pi with this commit if it interests you: https://github.com/azais-corentin/oh-my-pi/commit/6a74456f0b...

reply

upvote

by cyanydeez14 hours ago|

[-]

It's likely hiding the model downgrade path they require to meet sustainable revenue. Should be interesting if they can enshittify slowly enough to avoid the ablative loss of customers! Good luck all VCs!

reply

upvote

by vessenes14 hours ago|

[-]

They have super sustainable revenue. They are deadly supply constrained on compute, and have a really difficult balancing act over the next year or two in which they have to trade off spending that limited compute on model training so that they can stay ahead, while leaving enough of it available for customers that they can keep growing number of customers.

reply

upvote

by dainiusse14 hours ago|

[-]

But do they? When was the last time they declined your subscription because they have no compute?

reply

upvote

by mrandish11 hours ago|

[-]

> When was the last time they declined your subscription because they have no compute?

Is that a serious question? There have been a bunch of obvious signs in recent weeks they are significantly compute constrained and current revenue isn't adequate ranging from myriad reports of model regression ('Claude is getting dumber/slower') to today's announcement which first claims 4.7 the same price as 4.6 but later discloses "the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens" and "we’ve raised the default effort level to xhigh for all plans" and disclosing that all images are now processed at higher resolution which uses a lot more tokens.

In addition to the changes in performance, usage and consumption costs users can see, people say they are 'optimizing' opaque under-the-hood parameters as well. Hell, I'm still just a light user of their free web chat (Sonnet 4.6) and even that started getting noticeably slower/dumber a few weeks ago. Over months of casual use I ran into their free tier limits exactly twice. In the past week I've hit them every day, despite being especially light-use days. Two days ago the free web chat was overloaded for a couple hours ("Claude is unavailable now. Try again later"). Yesterday, I hit the free limit after literally five questions, two were revising an 8 line JS script and and three were on current news.

reply

upvote

by alwa13 hours ago|

[-]

Most weekdays.

https://status.claude.com/

reply

upvote

by vessenes13 hours ago|

[-]

Just last week. They cut off openclaw. And they added a price increased fast mode. And they announced today new features that are not included with max subscriptions.

They are short 5GW roughly and scrambling to add it.

reply

upvote

by dainiusse12 hours ago|

[-]

Now. Is it price increase or resource shortage. These are not the same thing.

reply

upvote

by vessenes12 hours ago|

[-]

If there is any elasticity to demand whatsoever, then these are the same thing.

reply

upvote

by cyanydeez13 hours ago|

[-]

IT's cute you think they're gonna do any full training of a model. As soon as they can extract cash from the machine, the better.

reply

upvote

by vessenes12 hours ago|

[-]

This is low effort thinking, and a low effort comment. They have a lot of cash. They do not think they have achieved a "city of geniuses" in a datacenter yet. They are racing against two high quality frontier model teams, with meta in the wings. They have billions of dollars in cash that they are currently trying to spend to increase their datacenter capacity.

Any compute time spent on inference is necessarily taken from training compute time, causing them long term strategic worries.

What part of that do you think leads toward cash extraction?

reply

upvote

by vdalhambra3 hours ago|

[-]

[flagged]

reply

upvote

by zhonghuajin4 hours ago|

[-]

[flagged]

reply

upvote

by boxingdog9 hours ago|

[-]

[dead]

reply