undefined

points

[-]

Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).

by ru5523 hours ago|

parent|

[-]

There's speculation that next Tuesday will be a big day for OpenAI and possibly GPT 6. Anthropic showed their hand today.

by varispeed31 minutes ago|

parent|

[-]

Sounds like a good opportunity to pause spending on nerfed 4.6 and wait for the new model to be released and then max out over 2 weeks before it gets nerfed again.

by enraged_camel3 hours ago|

parent|

prev|

[-]

That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.

by cyanydeez2 hours ago|

parent|

[-]

Ya'll know they're teaching to the test. I'll wait till someone devises a novel test that isn't contained in the datasets. Sure, they're still powerful.

by swalsh1 hours ago|

parent|

prev|

[-]

My understanding is GPT 6 works via synaptic space reasoning... which I find terrifying. I hope if true, OpenAI does some safety testing on that, beyond what they normally do.

by coppsilgold19 minutes ago|

parent|

[-]

Likely an improvement on:

> We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.

<https://arxiv.org/abs/2502.05171>

by tyre45 minutes ago|

parent|

prev|

[-]

From the recent New Yorker piece on Sam:

“My vibes don’t match a lot of the traditional A.I.-safety stuff,” Altman said. He insisted that he continued to prioritize these matters, but when pressed for specifics he was vague: “We still will run safety projects, or at least safety-adjacent projects.” When we asked to interview researchers at the company who were working on existential safety—the kinds of issues that could mean, as Altman once put it, “lights-out for all of us”—an OpenAI representative seemed confused. “What do you mean by ‘existential safety’?” he replied. “That’s not, like, a thing.”

by levocardia1 hours ago|

parent|

prev|

[-]

Oh you mean literally the thing in AI2027 that gets everyone killed? Wonderful.

by notrealyme1231 hours ago|

parent|

prev|

[-]

That's sounds really interesting. Do you have some hints where to read more?

by arm321 hours ago|

parent|

prev|

[-]

Oh, of course they will /s

by lumost10 minutes ago|

parent|

prev|

[-]

Is this even real? coming off the heals of GLM5.1's announcement this feels almost like a llama 4 launch to hedge off competition.

by Jcampuzano23 hours ago|

parent|

prev|

[-]

A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.

They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.

by cedws2 hours ago|

parent|

[-]

More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.

This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.

by TypesWillSaveUs2 hours ago|

parent|

[-]

Describing providing a highly valuable service for money as `rent seeking` is pretty wild.

by bertil1 hours ago|

parent|

[-]

It could be, formally, if they have a monopoly.

However, I’m tempted to compare to GitHub: if I join a new company, I will ask to be included to their GitHub account without hesitation. I couldn’t possibly imagine they wouldn’t have one. What makes the cost of that subscription reasonable is not just GitHub’s fear a crowd with pitchforks showing to their office, by also the fact that a possible answer to my non-question might be “Oh, we actually use GitLab.”

If Anthropic is as good as they say, it seems fairly doable to use the service to build something comparable: poach a few disgruntled employees, leverage the promise to undercut a many-trillion-dollar company to be a many-billion dollar company to get investors excited.

I’m sure the founders of Anthropic will have more money than they could possibly spend in ten lifetimes, but I can’t imagine there wouldn’t be some competition. Maybe this time it’s different, but I can’t see how.

by johnsimer1 hours ago|

parent|

[-]

> It could be, formally, if they have a monopoly.

you have 2 labs at the forefront (Anthropic/OpenAI), Google closely behind, xAI/Meta/half a dozen chinese companies all within 6-12 months. There is plenty of competition and price of equally intelligent tokens rapidly drop whenever a new intelligence level is achieved.

Unless the leading company uses a model to nefariously take over or neutralize another company, I don't really see a monopoly happening in the next 3 years.

by bertil14 minutes ago|

parent|

[-]

Precisely.

I was focusing on a theoretical dynamic analysis of competition (Would a monopoly make having a competitor easier or harder?) but you are right: practically, there are many players, and they are diverse enough in their values and interest to allow collusion.

We could be wrong: each of those could give birth to as many Basilisks (not sure I have a better name for those conscious, invisible, omni-present, self-serving monsters that so many people imagine will emerge) that coordinate and maintain collusion somehow, but classic economics (complementarity, competition, etc.) points at disruption and lowering costs.

by 1attice2 hours ago|

parent|

prev|

[-]

My housing is pretty valuable. I pay rent. Which timeline are you in?

by bonsai_spool1 hours ago|

parent|

[-]

Actually you're saying similar things:

Rent-seeking of old was a ground rent, monies paid for the land without considering the building that was on it.

Residential rents today often have implied warrants because of modern law, so your landlord is essentially selling you a service at a particular location.

by 1attice16 minutes ago|

parent|

[-]

thanks!

by kaashif1 hours ago|

parent|

prev|

[-]

Rent seeking refers to https://en.wikipedia.org/wiki/Rent-seeking

by 1attice15 minutes ago|

parent|

[-]

Yes I know that, read your sibling post

by mhluongo1 hours ago|

parent|

prev|

[-]

Two different "rent"s.

by 1attice16 minutes ago|

parent|

[-]

Not really see your sibling post

by robwwilliams4 minutes ago|

parent|

prev|

[-]

With Gemma-4 open and running on laptops and phones I see the flip side. How many non-HN users or researchers even need Opus 4.6e level performance? OpenAI, Anthropric and Google may be “rent seeking” from large corporations — like the Oracles and IBMs.

by aspenmartin2 hours ago|

parent|

prev|

[-]

Well don’t forget we still have competition. Were anthropic to rent seek OpenAI would undercut them. Were OpenAI and anthropic to collude that would be illegal. For anthropic to capture the entire coding agent market and THEN rent seek, these days it’s never been easier to raise $1B and start a competing lab

by cedws2 hours ago|

parent|

[-]

In practice this doesn't work though, the Mastercard-Visa duopoly is an example, two competing forces doesn't create aggressive enough competition to benefit the consumer. The only hope we have is the Chinese models, but it will always be too expensive to run the full models for yourself.

by brokencode2 hours ago|

parent|

[-]

New companies can enter this space. Google’s competing, though behind. Maybe Microsoft, Meta, Amazon, or Apple will come out with top notch models at some point.

There is no real barrier to a customer of Anthropic adopting a competing model in the future. All it takes is a big tech company deciding it’s worth it to train one.

On the other hand, Visa/Mastercard have a lot of lock-in due to consumers only wanting to get a card that’s accepted everywhere, and merchants not bothering to support a new type of card that no consumer has. There’s a major chicken and egg problem to overcome there.

by sghiassy2 hours ago|

parent|

prev|

[-]

Chinese competition can always be banned. Example: Chinese electric car competition

by sho_hn2 hours ago|

parent|

[-]

That's what OP was saying, I think, noting that running them locally won't be a solution.

by oblio1 hours ago|

parent|

prev|

[-]

Also Chinese smartphones. Huawei was about 12-18 months from becoming the biggest smartphone manufacturer in the world a few years ago. If it would have been allowed to sell its phones freely in the US I'm fairly sure Apple would have been closer to Nokia than to current day Apple.

by aurareturn1 hours ago|

parent|

[-]

If Huawei was never banned from using TSMC, they'd likely have a real Nvidia competitor and may have surpassed Apple in mobile chip designs.

They actually beat Apple A series to become the first phone to use the TSMC N7 node.

by therealdeal20201 hours ago|

parent|

prev|

[-]

but you are assuming that the magical wizards are the only ones who can create powerful AIs... mind you these people have been born just few decades ago. Their knowledge will be transferred and it will only take a few more decades until anyone can train powerful AIs ... you can only sit on tech for so long before everyone knows how to do it

by cedws1 hours ago|

parent|

[-]

It's not a matter of knowledge, it's a matter of resources. It takes billions of dollars of hardware to train a SOTA LLM and it's increasing all the time. You cannot possibly hope to compete as an independent or small startup.

by block_dagger30 minutes ago|

parent|

[-]

Presumably, the hardware to run this level of model will be democratized within the timeframe of the parent comment.

by MattRix1 hours ago|

parent|

prev|

[-]

The thing is that the current models can ALREADY replicate most software-based products and services on the market. The open source models are not far behind. At a certain point I'm not sure it matters if the frontier models can do faster and better. I see how they're useful for really complex and cutting edge use cases, but that's not what most people are using them for.

by quotemstr2 hours ago|

parent|

prev|

[-]

This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.

by scrawl1 hours ago|

parent|

[-]

Having done a quick search of "control AI dot com", it seems their intent is educate lawmakers & government in order to aid development of a strong regulatory framework around frontier AI development.

Not sure how this is consistent with "One private company gatekeeping access to revolutionary technology"?

by quotemstr1 hours ago|

parent|

[-]

> strong regulatory framework around frontier AI development

You have to decode feel-good words into the concrete policy. The EAs believe that the state should prohibit entities not aligned with their philosophy to develop AIs beyond a certain power level.

by frozenseven2 hours ago|

parent|

prev|

[-]

Couldn't agree more. The "safest" AI company is actually the biggest liability. I hope other companies make a move soon.

by FeepingCreature2 hours ago|

parent|

prev|

[-]

No it isn't lol. The consequence of the technology literally includes human extinction. I prefer 0 companies, but I'll take 1 over 5.

by guzfip2 hours ago|

parent|

prev|

[-]

> A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped

Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.

by WarmWash2 hours ago|

prev|

[-]

Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.

by mulmboy2 hours ago|

parent|

[-]

There are a few hints in the doc around this

> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)

^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.

> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)

> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)

by alyxya1 hours ago|

parent|

[-]

The first point is along the lines of what I'd expect given that claude code is generally reliable at this point. A model's raw intelligence doesn't seem as important right now compared to being able to support arbitrary length context.

by ninjagoo1 hours ago|

prev|

[-]

> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%

> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> OSWorld: 79.6% / 72.7% / 75.0% / —

Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?

And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.

What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?

by TacticalCoder54 minutes ago|

parent|

[-]

> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen

We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.

This sounds like a much better model than Opus 4.6.

by ninjagoo41 minutes ago|

parent|

[-]

> We're not reading the same numbers I think.

We must not be.

That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.

Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.

by nimchimpsky19 minutes ago|

parent|

[-]

[dead]

by pants23 hours ago|

prev|

[-]

We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%

by Leynos2 hours ago|

parent|

[-]

Opus 4.6 currently leads the remote labor index at 4.17. GPT-5.4 isn't measured on that one though: https://www.remotelabor.ai/

GPT 5.4 Pro leads Frontier Maths Tier 4 at 35%: https://epoch.ai/benchmarks/frontiermath-tier-4/

by randomtoast2 hours ago|

parent|

prev|

[-]

[dead]

by AlexC041 hours ago|

prev|

[-]

but how does it perform on pelican riding a bicycle bench? why are they hiding the truth?!

(edit: I hope this is an obvious joke. less facetiously these are pretty jaw dropping numbers)

by bertil1 hours ago|

parent|

[-]

We are all fans for Simon’s work, and his test is, strangely enough, quite good.

by johnnichev22 minutes ago|

prev|

[-]

damn... ok that's impressive.

by whalesalad3 hours ago|

prev|

[-]

Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.

by babelfish3 hours ago|

parent|

[-]

Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)

by girvo1 hours ago|

parent|

[-]

It really isn’t. I wish it was, because work complains about overuse of Opus.

by rafaelmn3 hours ago|

parent|

prev|

[-]

GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

by sho_hn2 hours ago|

parent|

[-]

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)

by boring-human1 hours ago|

parent|

[-]

Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.

by Jcampuzano23 hours ago|

parent|

prev|

[-]

Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

by camdenreslink10 minutes ago|

parent|

prev|

[-]

ChatGPT 5.4 with extra high reasoning has worked really well for me, and I don't notice a huge difference with Opus 4.6 with high reasoning (those are the 2 models/thinking modes I've used the most in the last month or so).

by leobuskin3 hours ago|

parent|

prev|

[-]

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

by blazespin1 hours ago|

parent|

[-]

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.

by porker1 hours ago|

parent|

[-]

What is RE in this context?

by astrange59 minutes ago|

parent|

[-]

Reverse engineering

by zarzavat3 hours ago|

parent|

prev|

[-]

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

by lilytweed2 hours ago|

parent|

[-]

Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

by kranke15517 minutes ago|

parent|

[-]

GPT was clearly changed after its sycophantic models lead to the lawsuits.

by chaos_emergent2 hours ago|

parent|

prev|

[-]

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

by whalesalad3 hours ago|

parent|

prev|

[-]

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

by ctoth2 hours ago|

parent|

[-]

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!

by simianwords3 hours ago|

prev|

[-]

The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.

by ollin2 hours ago|

parent|

[-]

My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.

by simianwords2 hours ago|

parent|

[-]

I stand corrected.