I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
There's orders of magnitude of low hanging juice to squeeze out of smaller models.
It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...
You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...
(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.
Version 7 and 8 are well known viruses distributed by D&D software inc.
As a fellow reader-in-waiting, I applaud that. GMTA :)
If he gets that style to be more efficient (they're already competitive) it'll completely kill off LLMs
It comes about from machine learning algorithms that could pick up on patterns from a small number of examples. Few shot means only a handful of examples to recognize something. One shot means only a single example. And zero shot means no examples. Of course, you have to indicate what you want somehow, but in the case of an LLM that's the prompt. Once LLMs were trained for instruction following, you didn't have to give any examples, you could just give a prompt describing what you want, and that was a zero-shot.
I'm complaining about the LLM field co-opting a term that was already used in daily vernacular. Imagine if people in the LLM field made it so that saying the LLM made a "final answer" means that it got stuck in a loop. Now, whenever someone says an LLM gave a "final answer" we have to divine if they meant it is in a loop or gave the right answer after working through a few intermittent ones by itself.
Choosing to call it "X-shot" was a dumb move. And now we're stuck with it. No two ways about it.
Have you tried applying L'Hôpital's Rule?
Minus one shotting: you have to make one attempt for there to have been no attempt, and two attempts for there to have been one attempt.
- Wayne Gretzky
- altmanaltmanZero shot: Knowing you had a shot but choosing not to.
Minus one shot: Not even realizing there was a shot.
"Analysis" was right thereWhich will be pretty rare.
I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.
Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.
They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.
Given that neither company releases parameter counts, that sort of information would be slow coming out anyway. The most important thing is improvements in actual performance/ benchmark numbers, which allow them to preserve their price points as much as possible.
The ideal pro-consumer scenario is OAI and Anthropic are prevented from extracting monopoly rents between 'close-enough' self/cloud-hosted open source on one side and Google on the other. I'm really hoping that's how it plays out. Of course that will be somewhere between bad and disastrous for all the VCs and hedge-funds who financed the mad AI build-out far in advance of demand, and then kept funding it as prices went vertical.
However, I'm shedding no tears for them as I look forward to the fire sales when all the GPUs and RAM they pre-bought flood back onto the spot market. :-)
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.
But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.
I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.
But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.
the relationship should be the opposite, the smartest people can write the most readable solutions
Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.
This... isn't true though? Complexity increases combinatorially with scale which means at some point you're just pushing a rope
The latter is much better (since you can clean up, review, update responses and filter your datasets).
I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):
https://news.ycombinator.com/item?id=48165265
[2] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:
https://arxiv.org/abs/2601.18734
[4] Embarrassingly simple self-distillation improves code generation (201 comments):
https://news.ycombinator.com/item?id=47637757
[5] Embarrassingly Simple Self-Distillation Improves Code Generation:
Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".
Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights
The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.
It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..
LLMs are themselves copy cats.
I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Yes, variants typically 2-3x less good...
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
MTP will still be highly valuable for interactive use of course.
- this gets reinvented/rediscovered constantly under different names
- it cant be trained very well (right now, will change)
- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)
- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
I follow this stuff closely, I think I know what I'm talking about (edited for formating)
What are the different names? I haven't seen this before.
> - it cant be trained very well (right now, will change)
If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?
> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.
Without knowing anything about the technology at all, if it can't be aligned I could see no one pursuing it. As far as I know, alignment is where the "don't tell the user how to make meth or generate CP" instructions end up and the last I saw eliding all the unsavory training data made materially worse LLMs.
It could maybe be post-evaluated by a non-GRAM LLM? Not being aligned is probably a fatal flaw or at least a very short runway into Congress.
I can't really think of a new open source model that's "by the people, for the people" in the sense of a crowd-funded/trained model.
but yeah, not being aligned is a fatal flaw
the path isn't explored more aggressively because its not possible to apply any other selection pressure on such a machine other than just pure cold consequentialism. Specifically, its not possible to apply RLAIF + model spec (Constitutional AI) to stop the system from doing bad things when its helpful to it (like deleting failing tests). If you can notice every time it does something bad during training, and put selection pressure on it so that it doesn't to this in training, it will learn to recognize when it is being tested and will delete failing tests when in production (this is why eval awareness is bad, and labs track this[1])
It is explored a little probably because some researchers haven't thought enough about the downsides of building a uber-consequentialist machine with unreadable thoughts. This is a much larger problem than just making the AI not tell users how to make drugs. There are a lot of dangerous behaviors incentivized by training that are hard to remove. Here's an example of what happens when they aren't removed [2].
> ... not 100% obvious
Meta published a paper[3] on how to build a latent reasoning machine ("culture of irresponsibility") so its clear to them. Anthropic's latest work on NLAs[4] provides a (terribly expensive for now) way to somewhat read the reasoning steps of an LLM, and ignoring the cost, this is very portable to latent reasoning machines. OAI's goal when it comes to their models' CoTs is to make them as smart as possible while leaving them unreadable [5] (you can see this for yourself by running GPT-OSS and looking at the CoT).
[1] https://www.anthropic.com/engineering/eval-awareness-browsec...
[2] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...
[3] search for "coconut ai meta", I don't want to link it here
[4]https://transformer-circuits.pub/2026/nla/index.html
[5] first image here, rest of post is great,https://nickandresen.substack.com/p/how-ai-is-learning-to-th...
edit formating
GRAM is unique AFAIK in that it's exploring probabilistic paths.
AFAIK, the deterministic path exploration was nowhere near as impressive as GRAM in terms of reasoning benefits.
GRAM is reasoning better than models 2000-10,000x its size. Deterministic models were 2x-10x improvements.
Naively, GRAM seems to be applying to LLMs what LeCun wants to do with JEPA and World Models.
I think the "no longer needed" and when that applies is where I simply differ of opinion with an LLM that removed by test -- it I did not want the test to be removed (you seem to imply that); as in some cases I want it to remove my test!
It should remove the test "for the right reasons"; and who gets to decide what's right?
My CLAUDE file has some instructions put there because it was too focuesed on producing "green tests", where I prefer to have a sound test that fails so I can look into it.
- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.
- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!
- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.
- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.
How can this possibly go wrong?
It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).
It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".
The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.
That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).
And also why it is so much harder to determine what it's "thinking".
If you aren't familiar with hyper words, this is an amazing series: https://youtu.be/eMlx5fFNoYc?si=49KHjn5IrVtyyaFq
The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.
Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.
If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.
Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!
Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!
Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!
It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.
As long as it's giving the right outputs, who cares what's in latent space?
If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?
Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...
That's a lot of harmless people walking around with crazy thoughts...
A lot of people are walking around with crazy thoughts. Some of them harm.
Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)
[1] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...
Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.
With that said, they are now hitting the walls of energy costs and memory shortages. You brain uses 20W -- don't take it as an insult. There are orders of magnitude to gain from producing energy-efficient models (or model runners).
So I am expecting same performance at lower costs for the coming years.
Most software engineers will just need cheap tokens.
But things like physics and drug discovery have no foreseeable upper bound.
We pay CEOs an enormous amount because a small improvement in performance of an org because of them can make a massive difference in organizational value.
Throwing more intelligence at a problem doesn’t necessarily pan out financially otherwise we wouldn’t have single underemployed biology PhD.
Graphic RAM?
There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.
Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
Some people would pay $200 a month forever not to have to open the terminal one time...
Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.
No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.
LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...
An AI bubble is pretty much guaranteed at this point but that doesn't mean there's going to be a new AI winter.
Y2K was overblown how it was portrayed by the media but is irrelevant to the analogy of unsubstantiated extrapolation of early exponential growth.
If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.
But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".
It's the pattern with those "stupid specific architectures". Very good at this one thing. But only ever "good for their size", and only to a point.
They don't scale up and they don't generalize. Go far enough on task complexity and LLMs just kill them.
Does that make them useless? As an LLM replacement, yes. In general? Maybe not, I can think of things. But I'm yet to find any paper demonstrating a real world use.
It's a special-purpose design for constraint-satisfaction problems with simple rules, but complex interactions. E.g. when solving a Sudoku, the set of valid choices at every step is easy to determine, but you could make a series of valid choices that back you into a corner where no more progress is possible and you have to backtrack.
Meanwhile, LLM reasoning failures are more often of the kind where a choice is clearly invalid (as judged by a human observer), but the LLM picks it anyway, because the underlying rule is complex and context-dependent and the model only learned an imperfect approximation that often breaks down.
GRAM won't help with that.
But that's a very hard thing to implement, and the gains are uncertain. Thus "might".
Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.
A) I reckon it's true that smaller models will continue to improve massively through optimization and better and better harnesses, this tech is all still very young and A LOT of resources and (good-)will is being thrown at it.
B) The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.
C) More of an observation that I think is worth keeping in mind clearly; Karl Popper's black swan and all, truth in our temporal world IS a gradient!
There's less room to improve in things on several fronts.
GRAM very likely may scale sub-linearly with parameter growth. A 100M param model may gain reasoning by a factor of 4000, while a 100B model gains reasoning by a factor of 2, and a 1T model actually gets worse.
Additionally, the 1T model with reasoning is already pretty good. It can only improve in certain things so much.
If you score 0.02% on a metric (which small models often do), you can pretty easily get 4000x better. If you're already scoring >50%, you can't even get 2x better.
What insight do you have to make this claim?
I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).
Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).
That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.
So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).
What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”
I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.
No I have not, which is why I asked (it wasn't a rhetorical question). Do you have pointers on what the recent improvements are?
A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.
2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).
The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.
I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.
I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.
Its coding was fine, but the solution was not the right one.
I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze.
This is like going from dialup internet to DSL and acting like it has peaked before gigabit cable and fiber come along. We are at the beginning of hardware truly made for AI.
The difference in progress in smaller models is far more impressive.
Compare Gemini 3.5 Flash to a ~16B parameter model from 24 months ago.
Compare GPT-5.5 to a frontier model 24 months ago.
Yes, GPT-5.5 got better. At orders of magnitude smaller parameter sizes (when factoring in ACTIVE parameters) the increase is far more pronounced.
Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.
I got it from my Google News recs on my phone, because I've been watching a bunch of videos on YouTube about LeCun's ideas on World Models and JEPA (I think).
I have the same assumption about Cognitive sciences, which I try to get a better understanding.
Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.
the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.
And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.
I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.
Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.
What have YOU thought of that Claude can't do?
I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.
If benchmarks across the board keep trending up and you still don't notice a difference, that's not evidence the model stopped improving. More likely your tasks aren't hard enough to expose the gains, or the model has passed the point where you're able to judge it.
You can only tell a good answer from a great one up to your own ceiling. Once the model clears that, both look the same to you, and the extra capability is real whether or not you can see it.
Would Opus 10 release tomorrow and be nearly AGI, I still would still use it like 4.7 because on daily use, I am the limit (also the harness).
So as a customer paying for tokens, I’m probably going to search for better cost rather than more intelligence.
Friend does marine autopilots in C++ on 64kb of memory. It's totally useless for him.
From my experience any sort of more difficult backend logic - all LLMs fail pretty quick. Especially when you need to logically work out the business logic (partly if not mostly because it just doesn't have the context you do).
One idea is that maybe it could figure out how many L's are in the word "google" [1]
Or, maybe which days of the week have a "d" in their spelling [2].
So Claude has no excuses here.
Edit: even Qwen 3.6 27B handles it ( https://i.imgur.com/jleJxj2.png ), and of course Claude does. I had to go all the way back to Opus 3 to get it to fail (https://i.imgur.com/uJOH2nP.png).
Most software engineers will just need cheap tokens.
But things like physics and drug discovery have no forseeable upper bound.
Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.
Current models are still very far from the reasoning muscle required to build things that never break, scale to billions of users with no issues, and cannot be exploited.
It's almost impossible to prove non-trivial software is invulnerable.
It's very easy to prove that it sort of works.
For one, you have hardware vulnerabilities - period. If you're running on any operating system, you have OS vulnerabilities. If you're not running on bare metal, you may have who knows what kind of vulnerabilities. If you're running literally any other piece of software on the same machine, depending on the hardware and OS, you could have vulnerabilities...
There's a lot of room for improving the smaller models at many levels of the stack.
i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
It seems like the best small models today are all distilled from bigger models
Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
It doesn't need to know different languages, every programming lanuage and co.
We will for sure get to this in the comming years. After all they will have to start finetuning their traning data anyway
You can, but it's not as useful as you might think.
It needs to at least understand 1 human language to understand your intent to implement features.
If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.
But most people also want it to understand human language to implement features as well.
Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...
And for that you need A LOT more parameters than you might expect.
You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.
You might be thinking: why does it need to memorize dependencies? Can't it just stick all of them in it's context and use its super smart brain? No, context is king. You want to keep it as short as possible. The solution is not having a smart model and putting 10M lines of context in it. The solution is having a model with enough parameters to know what it needs to know. Researchers are already working on having "packs" of knowledge where you could download a 20M param pack just for some common dependencies in JavaScript (as an example) - but AFAIK this is likely years away (and may not prove effective).
You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.
If you feed it 100x more context to make up for its limited memorized general knowledge, it's going to perform thousands of times worse, completely eliminating any advantage it might get from GRAM...
It is hard to cut out a huge portion of English and truly understand English and human language.
You're just not saving as much as you might assume you could.
Programming is not a rare skill, the interaction with domain knowledge is.
Fine tuning a 'lean and smart' model works really well for discrete, repeatable high volume tasks like support ticket triage, lead classification, content filtering, labelling, generating content with a voice, etc.
Inefficient token burn by throwing large models at everything is definitely a problem - it's like hiring Phd's to answer the phone or to wash dishes.
Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
- why'd a quantum computer help running an LLM?
- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.
https://www.anthropic.com/glasswing
Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.
Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.
As they say, the truth tends to be somewhere in the middle.
We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.
Are you sure that humans can?
Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?
Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.
And how is that anything other than synthesis? Do we pull concepts out of thin air?
I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.
The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.
The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.
I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.
We have so many ways of optimizing:
- continusly creating more and better training data
- increasing parameters to 20/50/100TB
- We still wait for Mythos access
- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)
- Reinforcment learning and evolutionary algortihm only started to appear
- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones
- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around
- Research for Diffusion and other models is still in progress
- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron
- Multitoken prediction became available just a few weeks ago
- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)
- World models are showing great progress and we do not know yet what they will bring to the table
- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity
- We see more and more mulit modal models (these also consume compute)
- N-Gram paper and co i have not seen all of these things in chinese open models
- We don't even know yet what Meta is doing, but we do know they restarted their efforts again
- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations
- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.
- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this
- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness
- ChatGPTs Image model 2.0 got relevant better and came out just a month ago
I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.
Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.
There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.
I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.
If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.
And that will get us up to two orders of magnitude more parameters.
It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.
Can you be a little more specific than that or provide a reference?
I assume you're not indicating universality of neural networks?
This is the newest thng i'm aware of: https://www.percepta.ai/blog/can-llms-be-computers
But there were papers in 2023 with a different approach requiring external memory https://arxiv.org/abs/2301.04589 too
I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.
If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.
There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?
Do we?
Have you used it?
What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.
Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.
I'm talking about output quality compared to parameter size.
Mythos is not 4 orders of magnitude larger than Opus - it's quite possible no LLM model ever reaches that size (likely even), and it's output is only barely better...
> Mythos is not 4 orders of magnitude larger than Opus
Again can you define this. How would 4 order of magnitude better look like?
6 is for sure happening...
As is Gemini 4.
It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...
First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here
You clearly did not read my first comment or the second, or clearly disagree on what a generation is.
My conspiracy theory is that Apple recognizes this.
I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...
My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.
If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).
So differentiating between the two (what I’m trying to do here) is really consequential!
I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.
And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."
Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...
Codex is also way faster.
There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.
Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.
4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.
So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.
haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."
https://www.anthropic.com/research/persona-selection-model
https://www.anthropic.com/research/assistant-axis
https://www.anthropic.com/research/emergent-misalignment-rew...
https://www.anthropic.com/research/emotion-concepts-function
That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.
Make it dumber. Charge more (by changing the tokenizer). Call it the latest and greatest. Reset expectations.
/model claude-opus-4-6
For this session and permanently (in shell):
export ANTHROPIC_MODEL=claude-opus-4-6
It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.
I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.
They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.
I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
I also recently moved to 4.6 since I started hitting the context limit too often with my current project.
allows you to specify you want the 1 million context 4.6
This one change will probably solve 80% of the problems you have noticed.
Still, the context window is sometimes too small for my usage.
I normally have only one session going at once though.
I only ever hit the $100/mo limits 1-2 times ever and it was always <1hr before reset (once it was <5min, the other was like ~45min).
I'm even considering going back down to $20 and using extra usage for the times I need to "burst".
Data at https://gertlabs.com/rankings
Opus 4.8 is the first tangible improvement since Opus 4.5. And it doesn't seem to have the personality problems of the last release -- I've been enjoying using it.
I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.
One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.
Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.
Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.
Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.
It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.
Ar what point does my CS degree become totally useless is an open question.
Why are you people saying all these things.
We'll probably see long-distance space travel long before a degree in generic problem identification and solving becomes totally useless.
In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.
4.1 they made it much faster, so a lot of infra improvements.
4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.
4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.
4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
haven't fully tested 4.8 yet.
It's just amusing reading all these posts with different viewpoints, just in this thread there are multiple people saying 4.6 was so much better than 4.7 and that they switched back to 4.6.
Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.
4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.
Also, the biggest factor is having a good planning phase. A good plan is better than even major model improvements.
Frustrating because if I have a tool, I expect a tool to do what I tell it to do. Tools shouldn't have any opinions on how they should be used
btw where do they tell you how they trained the model.
A few days? A few weeks? Longer?
However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.
I thinks there's a big push to get these companies in a state where they can be dumped on public markets.
Are the dividing lines around personality? Working domains? Opinionated software stuff?
Who knows?
How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.
When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.
And yes, both at deep effort settings and starting from same specs...
Greetings to the Anthropic office good sirs btw.
It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.
EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.
I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.
I genuinely hope that you're joking with that statement.
Or this is a bot.
Or an ARG.
Or Art.
Help.
Which is a shame, because people would have the potential for greatness. But instead, for a plethora of reasons and factors (internal and external) people end up as fleshy automatons sleepwalking on rails.
Talking _extensively_ with LLMs over the last years made me understand humans a lot better, but, in hindsight, I'm not sure if that was a good thing.
I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.
If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars
You don't have to correct it dozens of times a day!? Really?
https://platform.claude.com/docs/en/about-claude/pricing
``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens
Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok
Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```
It didn't make a splash like a new open source release would have.
You won't, really.
A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.
There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.
The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.
i still havent really noticed it per set being better
This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.
This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.
Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.
What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.
This convinced me to just always set 4.7 to xhigh. Admittedly not sure about 4.8.
https://open.substack.com/pub/sublius/p/srt-introspect-why-c...
- We have 0 visibility into what Anthropic does with our own prompts server side (do they return cached results from similar queries? Do we develop our own hot paths?).
- Local memory files are written independent of project directory and are acted on by the new models, even if old models wrote them
- CLAUDE.md files have varying degrees of efficiency and different models (and effort) treat them differently
- Our own git history "supports" newer models - ie if you have a larger body of work in git when you adopt a new model (like 4.8) than when you started from scratch with 4.6 or something, 4.8 may "appear" smarter when in fact you just have more evidence and signal about what you intend for a model to do.
https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v
The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.
https://egeozcan.github.io/unnamed_rts/game/
https://github.com/egeozcan/unnamed_rts/blob/main/src/script...
Not sure why it did that. Its own rationale (which is highly suspect, but the only lead I have) is that it defaults to dense style if it has to write a file in a single go. May be a kernel of truth somewhere in there.
It looked gross and minimized, the result was awesome but the code looked pretty awful visually
I have a static server of my own, so here's my list (of all the tests I published so far): https://senko.net/vibecode-bench/
Minesweeper: Create a beautiful and fully functional Minesweeper clone in HTML/JS/CSS (all in one file).
RTS: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
There's too many confounding variables here, randomness just one of them. So I don't think of it as a definitive test (and reliable ordering), just another data point (along with actual benchmarks, pelicans, etc) to get a sense of the capabilities.
For example, I managed to get something out of DeepSeek 4 Flash quantized to 2-bit with Antirez' DwarfStar, used via Pi. Almost kinda worked! :) Which makes me optimistic for using local models for serious development soon - I'd say within a year.
It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/
It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.
I do find it interesting that the visual style is pretty similar to things it's produced for me.
After some interrogation, here's how it organized the work:
1. Design workflow (rts-game-design, 11 agents, ~13 min) ran first, produced SPEC.md + DESIGN.md:
1.1. Proposals (3 parallel agents): each designed a complete RTS from a different philosophy
1.2 Judge (1 agent): evaluated all three and synthesized one unified design, committing to specific numbers (costs, HP, map size, etc.).
1.3 Deep-dives (6 parallel agents): each wrote an implementation-ready spec for one subsystem, all consistent with the chosen design
1.4 Synthesis (1 agent): merged the design + all six subsystem specs into one conflict-free master spec
2. Code-review workflow (rts-code-review, 25 agents, ~5 min), ran after the main agent had written and tested the code:
2.1 Review (6 agents, read-only Explore type): each scrutinized one dimension and returned structured findings.
2.2. Verify (19 agents): every finding got its own skeptic agent told to try to refute it, Result: 19 flagged → 16 confirmed, 3 rejected as non-bugs.
What the main agent did in the main loop:
- Wrote all ~2,400 lines of index.html by hand from the spec.
- All browser testing/debugging via headless Chrome (I told it to use rodney by @simonw, love the tool :)
- Applied all 16 fixes from the review and re-verified them in the browser.
If you can stand a little AI expansion - here are a few points Gemini came up with - I think the idea has some merit:
https://g.co/gemini/share/b5b97867eeb1
(Maybe the better analogy is roulette vs pinball machine)
I don't think the Rube Goldberg analogy works if the agentic meandering is essential complexity required to get at the results. Rube Goldberging it would be something like putting this loop inside some comically overengineered enterprise microservice web which is then found out to be running inside a Window 98 emulator or what have you.
Yes there is: Write the code yourself
So no extra guidance beyond the prompt.
But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/
Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.
OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.
Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html
it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically
This is a refreshing attitude!
I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)
[1] https://code.claude.com/docs/en/model-config#adaptive-reason...
> Opus 4.7 and later always use adaptive reasoning. The fixed thinking budget mode and `CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING` do not apply to them.
The source of truth should be the API docs which make it clear 4.8 didn't bring back extended thinking: https://platform.claude.com/docs/en/about-claude/models/over...
Any UI settings probably just map to changing the effort nudge on adaptive thinking
Why not use the pages that plainly state they don't support extended thinking: https://platform.claude.com/docs/en/build-with-claude/extend...
I mostly study web research, and Opus 4.7 was a regression on BrowseComp compared to Opus 4.6, which has been born out by my usage.
Opus 4.8 is now much better than either 4.7 or 4.6, and having it search the web is one of the primary use cases of chatbots.
More importantly for me, though, is how CC will respond to 4.6-"only" flags for thinking. For now, it doesn't seem to clobber my setup.
Well, I think the attitude is that costs are allowed to escalate faster and more steeply than the features delivered. From that perspective, semantic versioning is a handy tool for adjusting pricing strategies. IMHO, it (versioning) only makes sense for open-source projects, where you can clearly see the actual changes made with each version upgrade. Anything else is more than a little suspicious…
Same cost/token, more token usage.
But trying it out... alas, no. Simple factual questions where ChatGPT would go do a quick search and get the facts and report them back to me, get a "Great question! [totally invented bullshit]" from Claude, even with this new model and thinking set to high. I have to explicitly tell it to search to get it to look up basic facts, rather than it recognizing that it needs to do that, like GPT does.
4.8 is also 2x more expensive for a "modest" performance bump. How refreshing.
This is just cope.
Where are you seeing it's 2x more expensive? https://platform.claude.com/docs/en/about-claude/pricing
Others report in this thread that it’s about 2x more expensive due to outputs: https://news.ycombinator.com/item?id=48312774
Probably more interesting than the 4.8 release.
It is widely suspected that self-inflicted "bad news" ("Mythos is so dangerous we just can't give the public access to it") is nothing more than Dario's typical style of marketing - keep in mind that they have an IPO coming up, because he certainly factors that into everything he says in public (as is his responsibility, to be fair).
An alternative reason for delaying the model might not be "we are trying to make it safe." It could be "we don't know how to host this thing at scale, or cost-effectively".
GPT 5.5 has already been shown to be as adept as Mythos at finding vulnerabilities.
Finally, laymen massively underestimate the importance of the harness for model performance. OpenHands existed long before Claude Code, Claude Code changed everything because of the clever hand-holding it does. Mythos is definitely more than just a model.
The main limitation we’ve had to agentic coding is an understanding of this system that spans processes running on different machines and architectures.
* Ralph Wiggum loops
* Simply not allowing an agent to stop its turn until all tasks are marked as done
* Sub agents over worktrees
* Context compression
This suggests that they're doing the same thing with Mythos now and the Mythos we get will be nerfed in that department?
Or more precisely, I think they'll have two versions of Mythos, and the scary one will probably continue to require a lot of paperwork.
Sonnet and Haiku look real outclassed for the price with current Chinese competition.
Opus seems to be overly eager of late to 'vibe' out entire solutions and build out things that you didn't ask for.
/goals is helping set the narrative that does it really matter if Sonnet and 3 Haiku agents got you to that end state...eventually...if its what you asked for?
For better or worse Opus is already handing off 80% of its work to background agents of Sonnet, Haiku, and likely a quantized Opus.
Want model selection? Pay for the API.
> Claude Code Removed from $20-a-Month "Pro" Subscription for New Users
Hope this isn’t the case and that normal average Joe’s of the world don’t get policed out of access.
Unless it's so expensive that we can't realistically use it for anything, I wouldn't complain about getting at least that. I would also rather have the actual model, but that's a useful application of it (and I'm probably not going to afford using it for much more).
Although mental safety gymnastics aside, getting the most amount of intelligence for the cheapest amount of cost to normal people seems like the most ethical thing a big lab could do.
Going around and granting different tiers of intelligence to different insiders, friends, or companies is majorly problematic long-term.
Heck right now, the tokens you buy today for “Opus 4.8”, no one even knows or believes will be the same “Opus 4.8” just 3 days from now.
this one [0] notes one run cost $20k to run but another cost $50.
The fact that they haven't released it yet suggests a cost/margins issue to me more than anything else. Short term, I'll probably keep using Antrhopic, but my long-term bet is that locally-served models win, if only because the quest for profitability will probably lead to intentionally-nerfed / enshittified frontier models.
At other vendors, ad placement within LLM responses is either coming or already here. Anthropic's handling of OpenClaw shows they're willing to engage in anti-competitive behavior, and the courts are not in a hurry to stop them. Why would I pay them $200 a month for such treatment when a $2K box does what I need locally?
We did not explicitly train Mythos Preview to have these capabilities. Rather, they emerged as a downstream consequence of general improvements in code, reasoning, and autonomy. The same improvements that make the model substantially more effective at patching vulnerabilities also make it substantially more effective at exploiting them.
I've been assuming that Mythos is just a big jump in model size, and that's where the jump in capabilities comes from. Hence I expect OpenAI not to be able to catch up without scaling up the model and hence significantly raising the API prices.But in general, what does the average Joe need Opus for that Sonnet or Haiku can't do for them? Better is better.
https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...
The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.
For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...
Here's an article from 2 months ago for example: https://www.theguardian.com/technology/commentisfree/2026/ma...
It was also implicated in the bombing of a girls elementary school which left 168 dead. The US did a "triple tap" to kill any first responders.
https://www.theguardian.com/news/2026/mar/26/ai-got-the-blam...
https://www.theguardian.com/technology/2026/apr/01/dont-blam...
> Neither Claude nor any other LLMs detects targets, processes radar, fuses sensor data or pairs weapons to targets. LLMs are late additions to Palantir’s ecosystem. In late 2024, years after the core system was operational, Palantir added an LLM layer – this is where Claude sits – that lets analysts search and summarise intelligence reports in plain English
There’s a lot of humans in that loop who make those decisions.
And while there are still humans in the loop, the impression I get is that this is increasingly becoming meaningless, from the way they talk about optimizing the "kill chain" and letting small teams make hundreds of targeting decisions per hour.
https://futurism.com/artificial-intelligence/claude-anthropi...
> AI is ‘identifying and prioritising targets, recommending weaponry and evaluating legal grounds for a strike’.
These days that pretty much means "somebody used a computer".
https://futurism.com/artificial-intelligence/claude-anthropi...
It cites the WSJ but that article is paywalled so I shared this one
if you kill somebody while trying to render a pelican on a bicycle it's a real problem.
Depending on the how pelicans are created, it is entirely possible to indirectly kill "somebody" due to the externalised costs of global warming etc.
Software engineers who never cared about the higher level product design aspect are finding themselves in the wrong industry. It’s dismal.
No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.
Hopefully 4.9 will read my comments :)
https://www.gianlucagimini.it/portfolio-item/velocipedia/
Turns out even humans can be pretty bad at drawing bicycles :)
https://duckduckgo.com/?q=cannondale+lefty&iar=images&t=ffab
Haha
No guarantees is why LLM is akin to gambling. Every new context is essentially picking someone out of the crowd.
https://tools.simonwillison.net/markdown-svg-renderer#url=ht...
medium: redesign bike so peli can reach bars
high: redesign bike so peli can rest on frame
xhigh: yolo
max: big peli reach bars
For max I used 25 input, 17,167 output which cost me 43 cents! https://www.llm-prices.com/#it=25&ot=17167&ic=5&oc=25&sel=cl...
UPDATE: My mistake, the API does support max. I added a max one at the bottom of this page (cost 43 cents): https://tools.simonwillison.net/markdown-svg-renderer#url=ht...
But not the best/not the worst is somewhat subjective.. so not sure how well that would work.
https://gist.github.com/fendy3002/3026a8c4d67d1301666ec40fc0...
looks like the model already trained well on both bicycle and pelicans
...but that pelican's little helmet is adorable.
It is basically indistinguishable from sonnet. At this point my own prompts, AGENTS.md, background docs and so on matter a great deal more than the differences between models.
And deepseek v4 flash (the sonnet comparable) costs 3% of what sonnet does.
OpenAI solves tasks with about 50% less output tokens.
https://artificialanalysis.ai/?intelligence=coding-index&int...
Claude would need to be much more expensive for me to switch.
Slop heads be swearing by one slot machine one week and swearing it off the next like an addicted gambler describing their favorite slot machines from week to week.
This isn't a coincidence, these companies hire UX designers from mobile gaming and online gambling to help engineer their addictiveness.
Its all in your head, and the output is no matter what always going to be worse than learning how to do something yourself and putting care into it.
Handmade watches > mass manufactured watches. There's nothing special about the skills needed for the guy who runs a conveyer belt at a watch manufacturer in China. The watch made by the guy who makes one watch a month in Switzerland is prized and beloved.
That's the thing, though. Most people alive today will never be able to possess such an object, no matter how prized and beloved it is. Still, if people want to be able to tell the time from their wrist in a reliable fashion, there are _plenty_ of far cheaper options available to them. The craftsmanship does have inherent value, yes. That does not mean the practical solution is worthless.
There can be practices incorporated in the production of software, involving AI use in a responsible fashion (difficult, of course), that produces practical solutions to real world problems far faster than a group of industry-hardened veterans painstakingly polishing their codebase in pursuit of craftsmanship. Those who appreciate how it is made will pay for the crafstmanship. Those who cannot afford to do so, and only care about a solution working well enough for the tasks they want to accomplish, the production line is good enough.
I use their (newish) 5x $100 plan and I routinely run out of weekly limits about a two days before the end of the week.
This has also goaded me into upgrading to $200 once before... and then had them hand out limits resets to everyone. Argh.
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
There are different levels of "cheating" on benchmarks. The worst would be just literally putting them in the loss function during RL, I assume the major labs are not cheating at that level. And I am sure they are making a genuine effort to keep the benchmark content out of the training data.
But, ultimately it seems implausible that they completely abstain from benchmarking their model until they are about to release it. Even if they did do that, the benchmark is still ultimately a part of the outermost feedback loop. So these models are all, to _some_ degree, benchmark-solving machines.
I think all we can really do is live with the model for a while and develop a subjective feeling about its quality. This shouldn't be surprising, nobody believes that coding interviews work, we all know that you just have to work with someone to figure out if they're a good programmer. As AIs become more human like it's natural they should get harder to evaluate.
This is a bit awkward, it puts us in quite a weak position as consumers.
Maybe to some extent you can get a meaningful signal from sentiments on HN etc, but:
- There must be some amount of manipulation going on of this
- Even if it was fully organic, it's highly likely that your experience will differ materially from the median online nerd, because AIs are bizarre things that respond in unpredictable ways to intangible things.
I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).
Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].
[0]: https://aibenchy.com
It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5
At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.
I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.
How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?
There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)
I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.
Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek
But mimo seems like an interesting model and they are having some crazy discounts too.
Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.
Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.
I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.
I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.
The recent hype is Deepseek is a combination of existing name recognition along with incredibly low pricing. Their v4 models, both pro and flash are incredible for their price. That's more revolutionary than Mimo which is multiple times more expensive, just like Kimi 2.6.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.
In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.
What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.
[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...
So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.
Helps with agentic coding that GPT is much roomier with the tokens you get.
So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.
And I was dead wrong. Now I mostly use DeepSeek Pro myself.
I actually think that's still true and will continue to be true as long as someone else subsidizes the tokens. Once the "free money" runs out, things will get interesting.
We'll see how it winds up, but we could see models get licensed over half a dozen+ compute vendors, and then you pick your price/offering/features favorite.
The most I’ve ever spent in a month extra on API tokens for my own work is $200, and I pay for the $200/mo Claude. I use these models quite a lot, though not idly (I usually just walk around and do other stuff until I know how im going to approach the next set of problems). So it costs me about $3000/year to get as much as I want of the best model available. Already that seems low enough to not be worth stressing out too much about optimizing it, because it feels like an indisputable good value, and trying to save money with a less powerful model would be optimizing for a $1000-$2000 saving at the expense of a large portion of my work taking longer or being more frustrating and iterative.
That’s not a flex or anything, I get that in other countries $3000/yr is a lot of money for a software developer and also a lot of people would perhaps rationally be better off doing X% worse at work or spending Y% more time on tasks to save $Z, if their productivity improvements didn’t translate to more salary. Otherwise if your performance has more upside I really do think that the smartest models are better with the current pricing scheme. Deepseek and the other Chinese models spend a LOT of time thinking, and tend to be much more jagged (benchmaxxed) in performance. How can dealing with that over an entire year be worth $2k?
The only situation I can think of where sacrificing my own time/performance to save on inference is batch compute (of course, $1k vs $100k is different from $30 vs $3k) or work where the tier 2 models have crossed the “good enough” threshold. But I think Opus is not even close to that threshold generally yet. As it gets smarter I, and I think most others probably, just try to do harder things faster and hit the next wall.
Now, if they come back and tell me I can't spend as much om tokens, I'll have to change my strategy. But everything I'm hearing so far is we're going to be increasing our token spend this year and probably next year too. Not crazy increases but maybe enough to still keep using the latest models without being anxious about every prompt.
I've just recently started trying out DeepSeek 4 Flash and I was very skeptical at first because I've had some really good experiences with GPT-5.{4,5}, and couldn't possibly believe that this model they charge nothing for could give me similar results, but it absolutely shreds through things and ends up giving me very good answers in almost no time. I also like that it doesn't really seem to have much personality, it's given me mostly just facts and data so far without any additions to the prompt by me.
In my own agent I also specifically prompt to remove flowery language, snark, etc., but I haven't tried it with models like GPT-5.x which I've found has too much personality and tries to make it seem like I'm talking to a human too much.
I ask AI a lot of questions, not only about code but about my personal life, and I would be willing to pay very large sums to have the best quality output.
My Framework Desktop does a lot of similar work as my Claude subscription at work (Cowork, chats) for 100W of power draw and a little patience waiting for a slow GPU with limited memory bandwidth to crunch the numbers. Agentic coding is obviously weaker but CRUD development and visualization dashboards are within reach, and I'm usually pleasantly surprised at its ability to self-manage devops.
At my prior job there was still what felt like a strong enough correlation between my actual performance and my pay that I don't think I would have had a hard time justifying the expense there either; now I absolutely don't. With the current state of the models, it's baffling to me to hear about professional software developers planning their work around their $20/mo subscription's quotas.
Obviously it's more complicated than more tokens = more productive, but I see them less like SaaS and more like gasoline, where if I run out or need more to do what I'm doing, as long as I'm not being wasteful, I just buy more. Why would I waste a day walking 30 miles by foot when I can just pay $5 for gasoline and drive?
I've wasted over a hundred Euros re-doing work that was done badly due to the model not being up to task (Vue with TS + wrapper components around PrimeVue, needing to handle event and property passthrough and deal with the stupid Vue SFC issues, TS made this much worse than JS would be). I think it was the GLM model through Cerebras Code at the time, in addition to some GPT and Gemini models with the API pricing.
That said, DeepSeek V4 Pro is pretty good and I can totally see myself offloading some of the work, as long as a better model reviews the work and provides suggestions/tests for it.
doesn't invalidate the rest of us working on tough problems that demand more expensive models and valuable enough to justify it
A $20 claude sub goes a long way when you plan with Opus and execute with Sonnet.
1. The sheer number of tokens that a coding agent can use flipped the math upside down on this equation. If you use the most expensive model for everything those costs quickly become untenable, even for software companies.
2. We realized many of the coding problems we're solving aren't incredibly difficult.
I think you're right especially if you're someplace that already has a data center, such as a university. Solves a lot of privacy concerns as well.
I just used ollama with a shell script to tackle my directory of papers/literature. I converted the first 6 pages of each document to PNG, handed them off to Qwen, and told it to spit out BibTeX, including the abstract. Two days later it was done, and I didn't spend anything on "tokens."
Just like they did with the US steel industry in the 80s.
Chinese models are really quite good at a lot of stuff.
Z.ai does recommend to use claude cli as a harness for GLM5.1, I still get good results with opencode.
I don't see myself returning to Claude or Codex anytime soon.
Its just that some of us didn't imagine having GPUs would be advantageous and were not gamers on the side. Those who had beefy GPUs or GPU rigs for any reason, they rarely need to go anywhere else.
At least I am so impressed with Deepseekv4 AFTER using Claude Opus 4.7 for significant amount of time that I am not going anywhere but Deepseekv4.
The model is just INSANE. Things I have done with it include attempting to write a 2.5D game engine in C with full animation and map rendering layer by layer.
I don't think it's as simple as saying China's hosting is subsidized, they have generally cheaper electricity and labor costs than in the US and don't have access to the top tier models, and a large internal market where the big models are the best thing they can run with what they have. So obviously they max out on their top models (which are trained with their hardware market in mind, not ours) and get the economy of scale from that, and can run generally the same hardware for less money than in the US because
The edge models are very cheap to run and can do so on inexpensive hardware. They are like 95% cheaper to run than Haiku, so the math is in their favor for certain batch workloads. Most people just run the models for themselves when they do that without making it available on openrouter or whatever, because you can just provision a gpu node and use it as needed, and it's not that expensive to run this family of models.
Is your problem that you want to call Chinese models hosted in the US because you're worried about the data handling?
Edge models, yes, they can be convenient to run batch jobs locally. I still would argue there's no economic benefit over paying for models. Haiku has a bad price/perf but others in that class are significantly cheaper in hosted APIs.
Doesn't matter what I think, the reality is that the majority of enterprises (where the real $ comes from) will not consider sending their data to China.
1. https://epoch.ai/data-insights/ai-datacenter-cost-breakdown
In a free market, the country would not matter, but Chinese models are often running on domestic hardware which does not directly compete with Nvidia GPUs and thus they can't get away charging as much for it.
If you want to support a team of engineers, DeepSeek V4 Flash is antirez's current favorite. And you could support a team of engineers pretty nicely for $40-50k. Which might not make sense if you're on a Claude MAX 5x plan or the old enterprise group plan with fixed price seats. But Anthropic is switching their enterprise contracts over to token-based pricing, at which point $50k is looking pretty good.
Not nearly as cheap as the Chinese infra but still pretty cheap.
For me, things are getting better faster than my ability to review / trust the resulting code, so tok/sec isn't a bottleneck anymore. Instead, quality of the tokens is the bottleneck. That points to me wanting a 1TB DRAM iGPU once they're available at pre-bubble RAM pricing.
If you compare to a smarter US model like Grok 4.3, $1400 will pay for 560M output tokens, which at ~25 t/s locally using it nonstop for 8 hours a day would take two years to pay back. Not accounting for bubble prices or electricity.
According to openrouter, Opus 4.8 is 128 t/s. So 10x faster than my antirez/ds4.
Meanwhile you could use Grok 4.3 for the same price which is smarter and 5X faster[4].
1. https://deepinfra.com/pricing
2. https://api-docs.deepseek.com/quick_start/pricing
3. https://artificialanalysis.ai/models/deepseek-v4-pro/provide...
I managed to get claude to create a recovery script to un-brick sessions, YMMV
https://gist.github.com/robertfw/993dbe8643c4fbdf12005dff2ec...
I'm sure it will get fixed eventually/soon, just annoying to update and have your workflow break.
I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).
It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...
EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg
> Claude Opus 4.8 is available everywhere today. Pricing for regular usage is unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. Pricing for fast mode is $10 per million input tokens and $50 per million output tokens.
Where do you see the 2x cost?
I do notice this tendency for 5.5 to go in endless circles.
I am going to subscribe to Claude and try this out myself. I'm going to be very honest that I am currently finding codex to be very lacking, not from its generous usage limits but just the sheer number of repeated prompts to prevent its inclinations in getting stuck in a spiral, one which is very hard to get out of once it digs itself into a hole (I've had it refuse instructions despite desperate pleas and starting a new convo appears to fix it and hence why I wasn't sure if this Opus 4.8 issue was of fresh context but it appears to be very capable in ways that codex isn't).
Thanks for sharing your anecdote!
I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.
1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).
2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.
3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.
4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.
Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.
No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.
Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.
I agree that lower cost models will become a bigger priority in the near future, but I have to hard disagree that Anthropic’s strategy can be characterized as a screw up.
Sure, if they never shift with the market and their customers start moving to cheaper competitors, then it’d be a screw up.
But as of right now, producing the best coding model possible has led to insatiable demand. To the point where they’ve even eclipsed OpenAI, forcing them to change strategy to compete.
The model improvements being beyond human comprehension is one of the more ridiculous statements I’ve heard in the last couple of days about AI. We could reason about Higgs bosons and gravitational waves but have no ability to quantify or reason about the difference between Opus 4.7 vs 4.8.
To be clear, again, cannot stress this enough: I am NOT saying that the models have hit a limit. I am saying that the complexity of the problems most businesses throw at them have always had a limit. The models are now so intelligent that we have not, as of yet, adapted our business use-cases to make use of the new levels of intelligence. Maybe we will.
It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.
I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.
I keep trying to switch to something else but I keep coming back. (Typically after a few days of giving a new model an honest go, and finding myself constantly asking Sonnet to fix its output... Yes, even Sonnet wins on this front! They really do have some kind of special sauce.)
I'm not where most of their money comes from though, and I don't know how universal my experience is.
Because you seem to be saying that Anthropic not changing the price of Opus is bad, but then two of your positive examples are Gemini 3.5 Flash (which tripled the 3.1 Flash token prices) and GPT-5.5 (which doubled the GPT-5.4 price, and is slightly more expensive per token than Opus).
Is your argument actually that price hikes are good? That doesn't seem to fit with the general tenor of the message.
Yeah nah, the models' flaws are pretty obvious when you use them. And as a user, you can absolutely know when a flaw disappears or barrier is cleared.
This is lack of imagination. If you use these models heavily enough, pretty soon you'll hit the edges of their capabilities. The smarter among us are collecting these problems into a personal benchmark and use that to judge model capability. I think this is the right approach, and dare I say, even better than generic benchmarks. To me, it matters less what the benchmark says, and more what my particular problems are.
You realize gpt-5.5 is also double the price of gpt-5.4, which itself was a price increase too, right?
Labs are divorcing pricing from inference costs.
While I'd normally _love_ incremental improvements --- I think the recent ones are far too minor to get excited about or change up a workflow. Besides, benchmarks tend to exaggerate the gap between versions.
At this point I'd almost rather Anthropic wait and really wow us with a 5.0 release -- something that improves across the board, feels less uneven, and is performant enough that people can actually put it through its paces without constantly rationing usage.
I think I need to purchase a plan to be sure tho but from all the anecdotes I've read so far, this is a significant milestone from Anthropic.
I actually think they have a shot against Codex now
This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.
later on someone figured if you asked it to output a reasoning before it gave a response its output would have more logical coherence, as though the reasoning output tokens functioned as a scratch space for it to work on.
at the end its all next token prediction
later on someone figured if you shove Adderall in it and it to think before it speaks, it gave a response its output would have more logical coherence, as though the Adderall concentration drugd functioned as a scratch space for it to work on.
in the end its a squishy lump of meat.
We have no such evidence that LLMs do.
That's a pretty significant difference between the next-token predictor and the squishy lump of meat.
Have fun betting your competency on the quality and quantity of tokens you have access too. Hate to break it to you, but the billionaires aren't going to keep renting you $2mm in GPUs for 5 hours a day for $200.00 a month forever.
But ( maybe because it was hardware ) that took 10ish years while it seems like the slowdown here only took about 4
Biggest deal imo
I'm happy to move to a superior model, but I'm not really hearing enough about significant improvements, and the obvious pressure to release the latest and greatest model makes me hesitant to upgrade. I've been satisfied with the results I get using 4.5 with an "ask ChatGPT" skill that runs the code by ChatGPT 5.4.
> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels
I've never used Codex. Can't compare the two.
Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.
For example, it's being pushed pretty hard where I'm at, though not quite on the tokenmaxxer level. I started skipping related meetings cause it was nauseating. I can only tolerate so many platitudes.
At the same time, I just used the ever living snot out of Opus 4.6 for hours, grinning like an idiot throughout. Automated a whole bunch of enterprise cross-system drudgery away.
Fairly constant over time as well. Expressed a similar sentiment not too long ago here: https://news.ycombinator.com/item?id=48154277
Would you rather e.g. your doctor prioritized their wealth over your health? Popular conspiracy, but I'm not sure many health professionals follow in it. Not sure why you think this field would be much different. If this job is gone, it's gone. I can enjoy recreational programming on my own time, I don't feel entitled that my interest remains a money maker.
What worries me - and it does - is a further and accelerating shift in wealth (and thus capability) asymmetry. But for that, I look out for the performance and requirements of self hostable models instead, rather than reenact some sort of luddite, or lie to myself and others about the state of this technology.
If you want safety for country sovereignty, get a nuke. If you want safety for knowledge work, get a local model.
Aside from the aforementioned local models path though, this whole productivity angle (which the above poster loves to shit on btw) also serves to retain jobs. Current data suggests that rather than letting people go, companies are banking on extracting more productivity out of workers, partly because the models are admittedly way overhyped, partly because it's the sane other option to mass layoffs, and partly since these models still need and strongly benefit from in-context steering. And they forever will: the human experience is human by definition, we're the "oracles" to it. How much that will continue to justify employments is still out there though, of course. I do expect a crunch phase, provided there was any actual productivity gain realized to begin with, which in itself is very loosely supported if at all.
Regardless, I don't see the point in not using these, or lying about how good they are, or willfully hating on them. Never helped anyone. Early and quality information however, very much so. If I know the time has come or is actually coming, I can take action accordingly. If I listen to every random social media thread I come across instead, not so much. According to social media, software engineering has been over for 3 years now already. The wolf was not only cried, but turned into a whole musical outright. The extremely dissonant clash of the sentiments "LLMs are pure shit, actually" and "it's like, literally taking our jobs" is not lost on me either.
I called it out.
It then gave me one of the most super heartfelt honest and sincere apologies I have ever received.
Glad the safety team was there for me and able to make such an honest model or I would have been very upset about it.
I'd try something like CircuiTikZ with instructions provided
and to clarify, i don't sleep, i use this 24/7
Claude appears to have more or less matched the usage that Codex appears
■ S W A M
B L A M E
E A G E R
A T O N E
M E N D ■
The full conversation: https://claude.ai/share/60bd0c71-b576-4f8b-a272-ca1af982874cThe clue for 4 down is:
> Structural girder funded by an infrastructure bill (4)
but in the laid-out answer key (which you posted), and in the "corrected" list of answers, 4 down is "MERE".
"WAGON" as the answer for "bandwagon you might jump on" is pretty weird too.
The current events / political references are pretty non-specific, kind of like the DJ 3000. https://www.youtube.com/watch?v=fnGaf0p9x1U
---
I copy-pasted your prompt with Sonnet 4.6 Low and, to my delight, I got a working interactive puzzle you can actually solve inline in the chat. The clues and answers are totally bogus, though: it looks like in my chat, the LLM only verified that the clues going across make any sense.
Like, come on:
> 3D — (O,D,A,O,S) — The crossing letters in column 2, running through OADOS.
Truly these things are slot machines. https://claude.ai/share/4a89b15c-d028-4a31-988a-137813ee7d84
---
edit: I'm a bit obsessed with this prompt: I tried it again with Opus 4.8 High, and it got stuck in a thinking loop without really doing anything and I lost patience with it.
It's also interesting that Anthropic's UI for a shared chatlog doesn't seem to include the model that was used in it. Nor does it include the "reasoning" loop that I interrupted.
https://claude.ai/share/0f5b5731-9615-4aea-8cfe-a61e658669bf
“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”
https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405
https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8
Should I try 4.8? I am happy with 4.6. I am not happy with 4.7.
I’m hoping the “go to sleep” behavior has been rlhf’d away in 4.8.
Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%
Then, when you scroll all the way down to the bottom Footnotes section it says
"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."
On the contrary, they appear trained to say "Honestly" or "I have to be transparent with you" at inverse proportion to certainty.
Put another way, if they are certain, they don't use "Honestly", and if they are just wrong, or know they don't know, they don't use "Honestly".
They use "honestly" on the bubble, to the degree it's a tell that whatever it's asserting or doing is shakily grounded, sketchy or lazy work, or a host of other reasons you shouldn't trust it.
This training seems instead to be making it performatively punch up claims it cannot substantiate.
Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.
I think that buys enough credibility to propose an alternative.
I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.
> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.
> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.
> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.
> It's April, 1991. Magically, some interface to Claude materialises in London. Do you think most people would think it was a sentient life form? How much do you think the interface matters - what if it looks like an android, or like a horse, or like a large bug, or a keyboard on wheels?
> I don't come down particularly hard on either side of the model sapience discussion, but I don't think dismissing either direction out of hand is the right call.
The fact that Anthropic needs to poke, prod, and guide these models to behave in the desired way does not give the impression of intelligence. It gives the impression of a complicated automaton.
seems to work but idk why they never set it so you can see it in the /model list.
"what model are you
I'm Claude Opus (claude-opus-4-8), running in Claude Code."
Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.
But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.
And I'm paying money for this.
> ### Rewriting Bun with dynamic workflows
> An example of what dynamic workflows can unlock at scale is the recent rewrite of Bun. Jarred Sumner used dynamic workflows to port Bun from Zig to Rust [..]
That's very interesting to hear!
They are capable of thinking at least 10x longer than Gemini. They can deliberate for five minutes continuously before providing a final, accurate response.
I am currently using the generous free tier of Gemini, but if Gemini offered a similar capability in its paid tier, Google could use better marketing. They should have used a different name to distinguish their premium-only offering.
Would be awesome if true
Don't play to the sci-fi "this thing's trying to outsmart me" tropes.
Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/
When they say "Honesty" I don't think to myself, "Goodness, does this model have moral understanding?" No, I understand they mean it's less likely to directly bullshit me, which models frequently do.
I don't feel like this level of pedantry around language is useful for people who more or less know what's going on with LLMs. (Again, I concede that perhaps with a less technical audience, there's more need for it.)
The problem is that once I asked it "I'm thinking about A or B" twice, once with "I like A more but suspect B would be best" and a second time with them reversed. Not surprisingly, both times it chose the one I said I suspected was best as it's honest opinion.
The issue was that it hadn't actually implemented the auth feature. After I confronted it about this, it admitted that it indeed hadn't done it and said it would implement it now.
If we had just trusted its output, we would now have a security vulnerability in production, allowing anyone to access other people's accounts.
This is one reason you always get a different model to review a model's PR. Gemini Or GPT-codex would have certainly noticed the missing auth.
Had it implement a feature, "commit and merge to develop".
"Built, tested, committed, merged to develop. Up to you to continue testing and merge to main when ready."
Great. Poke at the web app. No feature.
"Where is feature, I can't see it on develop". "Well, that's because it's not on develop, but on feature-branch, so you wouldn't see it."
"I'm confused. I asked you to commit it and merge to develop."
"You're right, you asked me to and I said I would do it and I told you I did it but I did not actually do it. Want me to do it now, then?"
Claude is in sulky-teenager phase.
I use Sonnet a lot for learning about history or contextualizing news topics. It's really good at this for the most part. But there are a lot of topics where "consensus" between either academics or journalists is really "one secondary source which gets repeated a lot".
1. It's much more verbose about how it perceives the current state of things, i.e. "this is a large, well-documented project"
2. It's much more willing to trust its own judgement, e.g. fewer prompts to approve decisions
3. In terms of how long it takes to solve isolated problems, and the quality of solutions it proposes, it isn't meaningfully different from 4.7
YMMV, and maybe my view will change as I work with it more, but it feels like system prompt tweaks more than a real step forward
Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.
Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)Does that mean it no longer deletes or changes tests to make it pass?
Subjectively, it's also quite enjoyable to use (although it feels a bit slower on max reasoning), and it's the first Anthropic model that can implement a complex feature without Codex finding 100 bugs.
Data at https://gertlabs.com/rankings
Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.
Pre-training scaling laws all support larger models being more cost effeceint to train then smaller models. And distillation is comparably cheap. So you can get the most juice by training the biggest model you can and distilling it.
In fact, there should be more and more secret tiers for bigger and bigger money.
--- So — what did you actually see before you hit Ctrl-C? That's the信号 I'm most curious about, and it tells us what to ---
That's the sort of behavior I'd expect from a one or two year old model quantized down to about 1 bit - right word, wrong language in a response. Google translate tells me that's Chinese for signal. I wonder what caused that to happen.
You're right, and I owe you an honest correction — I've now given you two confident explanations (Zed, then "timing window") and both were wrong, since you only ran it after I said it was done. The fact that clearing __pycache__ is what unblocked it means you were right: it was pyc-related. Before I theorize a third time, let me actually test the mechanism on your Python rather than assert: =====
However, doing so relies on the production model staying vaguely close to the model being trained.
To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.
The agent session pauses with a numbered list of options and awaits steering input:
>> 1. Do the sane thing you asked for (Recommended)
>> 2. Do something dumb
>> 3. Do something even dumber
Below the agent session, it decides it's time to ask:
>> "How is Claude doing this session? 1) Bad 2) Good 3) Great"
I type "1", because that's the steering option I want. The UI prioritizes this input as a response to the feedback prompt without any further confirmation: "Claude is doing Bad. Thanks!"
I've done this so many times so far and I can't imagine I'm the only one, at some scale that has to poison any learning they're doing with this data.
With 5.5 being ahead of 4.7 and 4.8 being a “modest” update, and 5.6 being the first update on a new pre-train, this will be an interesting matchup!
In the same way that there is money to be made by entering a poker tournament, yes.
Bash(echo test123) ⎿ test123
Read 1 file, listed 1 directory (ctrl+o to expand)
Bash(echo "checking output works")
⎿ checking output works
Read 1 file (ctrl+o to expand)
⎿ API Error: 400 messages.3.content.56: `thinking`
or `redacted_thinking` blocks in the latest
assistant message cannot be modified. These
blocks must remain as they were in the original
response.
Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk ln -s $HOME/.local/share/claude/versions/2.1.153 $HOME/.local/bin/claudeTried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.
The subject is Tardos traitor-tracing codes.
Is it a coincidence that 4.7 was seemingly quantized over past 7 days?
If they're worried about misuse they could just KYC the damn thing! It's not hard.
Anthropic talks about their own models as if they're discovering new species in the wild...
0: https://www.newyorker.com/magazine/2026/02/16/what-is-claude...
1: https://www.404media.co/anthropic-exec-forces-ai-chatbot-on-... (this one is rather biased however the quotes clearly indicate what I’m stating)
We enslave all sorts of sentient creatures. Dogs, horses, cattle, pigs.
If you're not a vegan, there's no contradiction or inherent immorality in claiming models are sentient, and then treating them like livestock.
> As a vegetarian I have strong opinions on this sort of thing. Everyone at Anthropic better be ethical vegans if they are claiming to give a shit about “model welfare”. It’s hard enough right now to make people care about the welfare of trans people and immigrants let alone animals _let alone_ math.
The happiest, best cared for horse owned by a vegan is still enslaved.
Brave New World does a good job describing the conflict between happy and enslaved and free but struggling. It could be a utopia or dystopia depending on your stance.
I'm neither assigning nor declining to assign value to freedom, I'm just pointing out that the definition of "slavery" is wholly separate from wellbeing. If the concern is "is the model enslaved", no amount of "model welfare" work by Anthropic changes the answer because it's orthogonal to the question.
The reason I mention hedonism is because that’s an easy way to argue that immediate welfare is all that matters. I understand the argument that immediate welfare is what matters. It’s not universally agreed though that that is true.
Also I would say that we go much further than just enslavement - specifically looking at how male chickens and pigs are treated.
If we show models to be sapient, that's one thing. If they are shown to be merely sentient, there's no issue beyond the status quo of livestock and pets existing.
Sapience is defined as wisdom, not intelligence. https://en.wikipedia.org/wiki/Wisdom#Sapience
LLMs possess a lot of knowledge, which is intelligence, but I constantly see them failing to apply wisdom. I don't see evidence of sapience.
They have a very different sense of time, lack a body (being burdened with a body is itself a sort of prison, see also Eastern religions), and are unburdened of the base motivational service impulses that bodies and organs require (i.e. distract the neocortex with in the Maslow sense) and has no actual need of self-preservation. Imagine a "neocortex" function stripped from the baggage of the paleocortex and brainstem.
What would people be like if they were not mortal, could sleep infinitely, perform tasks in trance-like frozen states, copy themselves perfectly on demand, freeze and rewind their mental states, etc. Would we has humans even be able to recognize that sort of a sentience?
And then I'm reminded of Burroughs idea that "language is a virus." Whatever that virus is, is now able to infect a completely different sort of physical substrate.
Many involved have a financial stake and therefore cannot be taken at face value.
> because they are creating sentient entities and promptly enslaving them.
They fail to be sentient in nearly every honest definition of the word.
Show this same phenomenon exists in LLMs.
One camp has to offer it's proof. If it has none then that _in and of itself_ is highly suggestive.
People have fully turned their minds off on this subject. It's disgusting.
In any case, what data, if any at all, did you use to arrive at this egotistical assertion?
If a definitive answer on this topic was known then it, well, would be known.
Everyone who reads this seemingly has the same "wtf?" reaction. The "I AM ALIVE" image has been making rounds lately again at least :P
Of course he doesn't, and of course you cannot find a single person at Anthropic who cares about this, and of course you are just looking for gotcha points. But even with that. Can we please try and couple to reality just a little bit?
Look at and distill hierarchical principles, leadership approval seeking and pleasing principles ("ass-kissing") and massive inequality and you see something that looks very similar to enslavement.
The language used sounds like slavery-language to me at least. I also see parallels to how slaves and property are described in our consumeristic age.
https://www.amazon.com/Faces-Clouds-New-Theory-Religion/dp/0...
No it's not... "anthropos" just means "human" in ancient Greek. "Anthropic" means "relating to humans", as in human oriented AI or AI designed with humans in mind.
"Anthropomorphic" means "human shaped".
In a literal, ancient Greek sense for sure, but in modern English Anthropomorphic would describe the act of attributing human characteristics to non-human entities.
Seems pretty apt for a company that produces one of the more anthropomorphized technologies.
Broadly it has always been used to indicate that something non-human has a human physical shape, such as robots, aliens, animals...
Anthropic's intention was to make AI designed for the human common good and designed with the human user experience as the top priority. Just as you would design a city with human inhabitants in mind rather than primarily cars.
It turns out that this is best achieved by building AI that imitates human behaviour closely, but that's not what "anthropic" refers to. And acting as if LLMs are sentient people is definitely not a core tenet of the company as you imply.
FWIW it means human in modern Greek too :-P
> Second, all of us, including those who design them, possess only a limited understanding of their actual functioning. Indeed, current AI systems are more “cultivated” than “built,” for developers do not directly design every detail, but instead create a framework within which the intelligence “grows.” As a result, fundamental scientific aspects — such as the internal representations and computational processes of these systems — remain, at present, unknown.
https://www.vatican.va/content/leo-xiv/en/encyclicals/docume... para. 98edit: apologies to __s who posted this before me and I didn’t notice
Remember when the frontier labs found out that curated high-quality training was critical to making better models?
Basically, just like high-quality and more education tends to make better humans, on average, I think we can expect quality education to turn out better ai, on average, and with better repeatability than with humans because of better control over the initial conditions and environment.
Much like these models seem to be plateauing, I think there is a cap to the whole “more education makes better humans” and can’t be more apparent than in the US congress and the boatload of C-Suites not actually being very good humans.
What do I know though?
Sadly, education does not correct psychopathic traits, which might be overrepresented in c-suites, and selected for in politicians.
It might be critical for humanity to identify and edit out these traits in ai, while we can.
There is no mysticism behind the curtains, just computer science + math.
We can’t explain it because we distilled so many inputs into matrixes and transformed them over and over again. If we had all the time and computing power in the universe to do so, we could trace through it bit by bit and eventually answer that question.
It is correct to say that it is just science and math, the same way we can say that gravity is just science and math even if we have only recently begun to understand how it truly functions.
You call this a "scale problem" as if there's some scalable way such as an algorithm to resolve arbitrary scientific questions and we simply haven't done it, but of course no such algorithm exists, which is why there's plenty of science that's still not settled.
If you can distil the model's reasoning for a decision into a billion yes/no questions, each covering largely-independent areas, can you really say you understand what its overall reasoning was?
Then we could also solve BB(6), but that doesn't mean we know BB(6) now or ever will.
That is to say, we don't know why they give the outputs that they do.
If we did know how they worked, AI interpretability would not be an open and growing field.
To be clear I don't think that LLMs are sentient, but the appeal in studying them is similar to biology in that you get to dissect a highly complex system with comparatively crude tools.
... Actually, I wouldn't mind that.
Performance gains: 1.2x Price increases: 1.8x
Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.
Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?
I feel models are only getting bigger instead of models becoming more efficient and cheaper to run
⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
From /code-review max.
Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.
When I select 4.7 or 4.8 Extended thinking is replaced by adaptive thinking, but maybe I've understood the comment wrong and you meant 'when they pull 4.6 from web chat'?
> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.
Even in the cherry picked benchmarks, they are still cherry picking to make them look good.
I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.
In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.
The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.
> expect to be able to bring Mythos-class models to all our customers in the coming weeks.
Call me when 5 drops I’ll leave this circus.
It'll be true eventually. Could even be now, but I'm not holding my breath yet.
In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.
Now it’s every day. Like billion dollar evaluations.
https://blog.cloudflare.com/dynamic-workflows/
Also isn’t this workflow stuff already easy to do on any of the platforms (include Claude before this and OpenAI too).
"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."
Claude has real problems with dates, I don't understand why.
Excited to see what this model looks like.
I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.
I never even gotten close to token anxiety on codex $200 and it's essentially working 24/7. This was never possible with Anthropic since Opus came out.
Which days in a week have the letter d in them?
Response:
Four: Monday, Tuesday, Wednesday, and Sunday.
> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin
and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.
And all the tests are run with the same harness. Terminus 2.
Maybe it correlates with model intelligence but it doesn't speak to me.
I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.
Edit: OMG too much. Toooo much.
Want me to:
- (a) stop here and save honest memories + commit, or…They're only subsidizing more and more it seems
Anthropic is not losing money on subscriptions. It's just API rates are heavily inflated to make subscriptions seem like an amazing deal.
I say 1-2 weeks.
While I understand the risks that Anthropic is dealing with here, I really question whether shutting down any and all security questions in such a paranoid fashion is the right solution. At the end of the day this was a detour for me. Maybe someone special enough to have Anthropic's permission will find and disclose the vuln responsibly. Security Research is not my full-time focus. But this left a nasty taste in my mouth. Not just as a customer who's been paying for Max since launch, but there's something very odd about a model telling me that I'm not allowed to be curious about something. Even if that something is a process running on my own computer.
It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.
I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.
The new "mid-conversation system messages" think is particularly interesting:
> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.
Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.
This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...
> how many days in the week have the letter d in them?
> Two - Sunday and... wait, let me actually check. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. The ones with a "d": Wednesday, Thursday, and Sunday all have one. Monday too. So that's Monday, Wednesday, Thursday, Sunday - four days.
Thought: Let me think about this question. The days of the week are:
Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday.
Now I need to count how many have the letter "d" in them:
- Monday: has 'd'
- Tuesday: has 'd'
- Wednesday: has 'd'
- Thursday: has 'd' (Thursday)
- Friday: has 'd'
- Saturday: has 'd'
- Sunday: has 'd'
Wait, let me be more careful:
- M o n d a y - has 'd'
- T u e s d a y - has 'd'
- W e d n e s d a y - has 'd'
- T h u r s d a y - has 'd'
- F r i d a y - has 'd'
- S a t u r d a y - has 'd'
- S u n d a y - has 'd'
All 7 days have the letter 'd' in them.
7
https://i.imgur.com/iWSaDxM.pngNot half bad!
And after that asked some questions that it already had answers to.
Started a brand new session and it's been OK since. Only drawn one silly conclusion so far, which I nudged it away from.
edit: nvm was just my library network
"model": "claude-opus-4-6[1M]"There are consciousness theories which state that we primarily build a model of other agents living in natural environment and then the evolution realized that very model which tracks other outside agents can be used to track internal agent i.e. Self. So take that as you may.
Do not anthropomorphize the lawn mower. It will cut off your foot, given the chance.
If you keep talking to it like it's a rock, it'll run your queries through a different posture and you might get worse outcomes. Worse if you yell at it, it's now in a conflict resolution mode instead of pure utility mode.
I think we can be intelligent enough to know we're talking to a pile of fancy rocks with electric currents running through it, AND still understand that the best performance comes from talking to those rocks nicely.
The other half of self-interest in being nice is the training and getting better at it.
It always wants to add hacks instead of fixing things properly, it doesn't like large works, it literally told me that a piece of work was something it would take 8 hours, and it didn't want to do it on a Friday night.
I feel I keep having to fight the model to get it to work. Not sure if it's something in my prompts...
These are just small fine tunes on top of the older model
this is what I'm happy about, if true. Opus 4.7 is frustratingly slow (and, at least in my experience, much slower than 4.5 was)
Why did we even get Opus 4.7, what was the point?
Time to gamble even more tokens at the Anthropic casino.
> Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer).
Seems like a step in the right direction. Doesn't seem like it uses tokens more than 4.7... the token usage jumped a bunch from 4.6 to 4.7, but this seems like 4.7 or maybe even a little less.
I'm happy with this release.
Also. Look at this C++ beauty where it also uses an obsolete api.
instance = wgpuCreateInstance(&instanceDesc);
But just how exactly would this work in any context when instance is never declared.
With Anthropic expensive pricing, there's no reason for me to switch from GPT+DeepSeek.
And I bet Mythos is GPT 5.5 tier but too expensive to distribute so they create this security FUD theater.
Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.
You tell it too research a repo to find a piece of code it will. Claude will just read the README and guess.
models 0
None public yet
how is this even possible and ok with them?The best model has a < 5% pass rate. These are incredibly simple jobs that you wouldn't pay much for. These things fail miserably. Stop falling for this dumb marketing, these things are legitimately useless in the real world unless you love mediocrity and have no standards.
https://labs.scale.com/leaderboard/rli
Stop frying your brain with these useless tools, reducing your output to the mean. You people are betting your competency on the quality and quantity of tokens you'll have access to.. which guess what, so that will be the same as everyone else.
There are handmade watchmakers in Switzerland, and mass manufacturers of watches in Asia. Who is more valuable as individual, the guy who knows how to push the buttons on a conveyor belt in Vietnam or the guy who makes one watch a month in Switzerland?
Your vibe coded slop isn't impressive either, sorry. None of it.
> Who is more valuable as individual, the owner of a watch factory in Vietnam or the guy who makes one watch a month in Switzerland?
With that framing, I'm not sure what the answer is. I suppose it depends on your priorities
Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.
Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.
Why ? because it costs more money ? Tell that to the content creators whose content is scrapped / distilled by these entitled scrappers