undefined

points

[-]

That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?

by osti5 hours ago|

parent|

[-]

It's only that one number that is for sonnet.

by 0123456789ABCDE4 hours ago|

parent|

[-]

except for the webarena-verified

by conradkay5 hours ago|

parent|

prev|

[-]

Sonnet was pretty close to (or better than) Opus in a lot of benchmarks, I don't think it's a big deal

by jitl5 hours ago|

parent|

[-]

wat

by 0123456789ABCDE4 hours ago|

parent|

[-]

maybe gp's use of the word "lots" is unwarranted

https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.

see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...

by chabes6 hours ago|

prev|

[-]

Definitely don’t want to click in at x either.

by thejarren6 hours ago|

parent|

[-]

Solution https://xcancel.com/OpenAI/status/2029620619743219811?s=20

by Sabinus3 hours ago|

parent|

prev|

[-]

Get a redirect plugin and set it up to send you to xcancel instead of Twitter. I've done it, and it's very convenient.

by anonym00se16 hours ago|

parent|

prev|

[-]

Ditto, but I did anyways and enjoyed that OpenAI doesn't include the dogwater that is Grok on their scorecard.

by observationist6 hours ago|

parent|

prev|

[-]

[flagged]

by Aboutplants6 hours ago|

prev|

[-]

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

by observationist6 hours ago|

parent|

[-]

Benchmarks don't capture a lot - relative response times, vibes, what unmeasured capabilities are jagged and which are smooth, etc. I find there's a lot of difference between models - there are things which Grok is better than ChatGPT for that the benchmarks get inverted, and vice versa. There's also the UI and tools at hand - ChatGPT image gen is just straight up better, but Grok Imagine does better videos, and is faster.

Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.

I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.

by baq6 hours ago|

parent|

[-]

Gemini 3.1 slaps all other models at subtle concurrency bugs, sql and js security hardening when reviewing. (Obviously haven’t tested gpt 5.4 yet.)

It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.

by observationist5 hours ago|

parent|

[-]

I have a few standard problems I throw at AI to see if they can solve them cleanly, like visualizing a neural network, then sorting each neuron in each layer by synaptic weights, largest to smallest, correctly reordering any previous and subsequent connected neurons such that the network function remains exactly the same. You should end up with the last layer ordered largest to smallest, and prior layers shuffled accordingly, and I still haven't had a model one-shot it. I spent an hour poking and prodding codex a few weeks back and got it done, but it conceptually seems like it should be a one-shot problem.

by thejohnconway1 hours ago|

parent|

[-]

Lol, I’ve had cutting edge models suggest I make an inflexible hole bigger by putting shim in it, and argue their case stubbornly. I don’t know what you’re using to suggest they are anywhere near solving your problem there!

by adonese5 hours ago|

parent|

prev|

[-]

Which subscription do you have to use it? Via Google ai pro and gemini cli i always get timeouts due to model being under heavy usage. The chat interface is there and I do have 3.1 pro as well, but wondering if the chat is the only way of accessing it.

by baq5 hours ago|

parent|

[-]

Cursor sub from $DAYJOB.

by basch4 hours ago|

parent|

prev|

[-]

>ChatGPT image gen is just straight up better

Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.

by bigyabai6 hours ago|

parent|

prev|

[-]

> If this rate of progress is steady, though, this year is gonna be crazy.

Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.

by observationist6 hours ago|

parent|

[-]

If you look at the difference in quality between gpt-2 and 3, it feels like a big step, but the difference between 5.2 and 5.4 is more massive, it's just that they're both similarly capable and competent. I don't think it's an S curve; we're not plateauing. Million token context windows and cached prompts are a huge space for hacking on model behaviors and customization, without finetuning. Research is proceeding at light speed, and we might see the first continual/online learning models in the near future. That could definitively push models past the point of human level generality, but at the very least will help us discover what the next missing piece is for AGI.

by ryandrake5 hours ago|

parent|

[-]

For 2026, I am really interested in seeing whether local models can remain where they are: ~1 year behind the state of the art, to the point where a reasonably quantized November 2026 local model running on a consumer GPU actually performs like Opus 4.5.

I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.

by mootothemax4 hours ago|

parent|

[-]

Huh, that’s interesting, I’ve been having very similar thoughts lately about what the near-ish term of this tech looks like.

My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.

by thewebguyd6 hours ago|

parent|

prev|

[-]

Kind of reinforces that a model is not a moat. Products, not models, are what's going to determine who gets to stay in business or not.

by gregpred6 hours ago|

parent|

[-]

Memory (model usage over time) is the moat.

by energy1236 hours ago|

parent|

prev|

[-]

Narrative violation: revenue run rates are increasing exponentially with about 50% gross margins.

by kseniamorph5 hours ago|

parent|

prev|

[-]

makes sense, but i'd separate two things: models converging in ability vs hitting a fundamental ceiling. what we're probably seeing is the current training recipe plateauing — bigger model, more tokens, same optimizer. that would explain the convergence. but that's not necessarily the architecture being maxed out. would be interesting to see what happens when genuinely new approaches get to frontier scale.

by druskacik6 hours ago|

parent|

prev|

[-]

That has been true for some time now, definitely since Claude 3 release two years ago.

by swingboy6 hours ago|

prev|

[-]

Why do so many people in the comments want 4o so bad?

by cheema335 hours ago|

parent|

[-]

> Why do so many people in the comments want 4o so bad?

You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.

by astrange6 hours ago|

parent|

prev|

[-]

They have AI psychosis and think it's their boyfriend.

The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.

by baq6 hours ago|

parent|

[-]

Somebody on Twitter used Claude code to connect… toys… as mcps to Claude chat.

We’ve seen nothing yet.

by mikkupikku6 hours ago|

parent|

[-]

My computer ethics teacher was obsessed with 'teledildonics' 30 years ago. There's nothing new under the sun.

by Sharlin4 hours ago|

parent|

[-]

There are many games these days that support controllable sex toys. There's an interface for that, of course: https://github.com/buttplugio/buttplug. Written in Rust, of course.

by the_af4 hours ago|

parent|

[-]

> Written in Rust, of course.

Safety is important.

by vntok5 hours ago|

parent|

prev|

[-]

Was your teacher Ted Nelson?

by mikkupikku4 hours ago|

parent|

[-]

I wish, dude is a legend.

by manmal5 hours ago|

parent|

prev|

[-]

ding-dong-cli is needed

by Herring5 hours ago|

parent|

prev|

[-]

what.. :o

by embedding-shape6 hours ago|

parent|

prev|

[-]

Someone correct me if I'm wrong, but seemingly a lot of the people who found a "love interest" in LLMs seems to have preferred 4o for some reason. There was a lot of loud voices about that in the subreddit r/MyBoyfriendIsAI when it initially went away.

by drittich5 hours ago|

parent|

[-]

I think it's time for an https://hotornot.com for AI models.

by vntok5 hours ago|

parent|

[-]

botornot?

by MattGaiser6 hours ago|

parent|

prev|

[-]

The writing with the 5 models feels a lot less human. It is a vibe, but a common one.

by karmasimida6 hours ago|

prev|

[-]

It is a bigger model, confirmed

by MarcFrame5 hours ago|

prev|

[-]

how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?

by nico12075 hours ago|

parent|

[-]

Well 5.4-pro is the more expensive and more advanced version of 5.4-thinking so why wouldn't it?

by nimchimpsky5 hours ago|

parent|

prev|

[-]

[dead]

by dom965 hours ago|

prev|

[-]

Why do none of the benchmarks test for hallucinations?

by tedsanders4 hours ago|

parent|

[-]

In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).

Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.

(I work at OpenAI.)

by netule5 hours ago|

parent|

prev|

[-]

[flagged]