https://artificialanalysis.ai indicates that sonnect 4.6 beats opus 4.6 on GDPval-AA, Terminal-Bench Hard, AA Long context Reasoning, IFBench.
see: https://artificialanalysis.ai/?models=claude-sonnet-4-6%2Ccl...
Gemini and Claude also have their strengths, apparently Claude handles real world software better, but with the extended context and improvements to Codex, ChatGPT might end up taking the lead there as well.
I don't think the linear scoring on some of the things being measured is quite applicable in the ways that they're being used, either - a 1% increase for a given benchmark could mean a 50% capabilities jump relative to a human skill level. If this rate of progress is steady, though, this year is gonna be crazy.
It’s a required step for me at this point to run any and all backend changes through Gemini 3.1 pro.
Yet so much slower than Gemini / Nano Banana to make it almost unusable for anything iterative.
Do you want to make any concrete predictions of what we'll see at this pace? It feels like we're reaching the end of the S-curve, at least to me.
I am betting that the days of these AI companies losing money on inference are numbered, and we're going to be much more dependent on local capabilities sooner rather than later. I predict that the equivalent of Claude Max 20x will cost $2000/mo in March of 2027.
My biggest worry is that the private jet class of people end up with absurdly powerful AI at their fingertips, while the rest of us are left with our BigMac McAIs.
You can ask 4o to tell you "I love you" and it will comply. Some people really really want/need that. Later models don't go along with those requests and ask you to focus on human connections.
The 5.x series have terrible writing styles, which is one way to cut down on sycophancy.
We’ve seen nothing yet.
Safety is important.
Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.
(I work at OpenAI.)