Are the scores here normalized such that each point difference is equidistant?
rank score age size name
1 62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
2 59.1 55 - GPT-5.5 (xhigh)
3 58.5 55 - GPT-5.5 (high)
4 57.2 104 - GPT-5.4 (xhigh)
5 56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
6 55.5 118 - Gemini 3.1 Pro Preview
7 53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
8 53.1 132 - GPT-5.3 Codex (xhigh)
9 52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
10 51.5 92 - GPT-5.4 mini (xhigh)
11 50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
12 50.7 1 large GLM-5.2 (max)
13 50.1 29 - Qwen3.7 Max
14 48.7 188 - GPT-5.2 (xhigh)
15 48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
16 47.8 205 - Claude Opus 4.5 (Reasoning)
17 47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
18 47.5 70 - Muse Spark
19 47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
20 47.1 58 large Kimi K2.6
21 47.1 29 - Gemini 3.5 Flash (minimal)
22 46.7 449 - Gemini 2.5 Pro Preview (Mar' 25)
23 46.5 211 - Gemini 3 Pro Preview (high)
24 46.5 16 - Qwen3.7 Plus
25 46.4 120 - Claude Sonnet 4.6 (Non-reasoning, High Effort)
26 45.6 5 large Kimi K2.7 Code
27 45.6 104 - GPT-5.4 (low)
28 45.5 56 large MiMo-V2.5-Pro
29 45.1 43 - GPT-5.5 Instant (May 2026)
30 45.0 29 - Gemini 3.5 Flash (high)
31 44.9 58 - Qwen3.6 Max Preview
32 44.7 216 - GPT-5.1 (high)
33 44.2 188 - GPT-5.2 (medium)
34 44.2 126 large GLM-5 (Reasoning)
35 43.9 92 - GPT-5.4 nano (xhigh)
36 43.4 71 large GLM-5.1 (Reasoning)
37 43.4 16 large MiniMax-M3
38 43.2 54 large DeepSeek V4 Pro (Reasoning, High Effort)
39 43.0 188 - GPT-5.2 Codex (xhigh)
40 42.9 76 - Qwen3.6 Plus
41 42.9 205 - Claude Opus 4.5 (Non-reasoning)
42 42.6 182 - Gemini 3 Flash Preview (Reasoning)
43 42.2 99 - Grok 4.20 0309 (Reasoning)
44 42.1 56 large MiMo-V2.5
45 41.9 91 large MiniMax-M2.7
46 41.4 91 - MiMo-V2-Pro
47 41.3 121 large Qwen3.5 397B A17B (Reasoning)
48 41.0 48 - Grok 4.3 (high)
49 40.5 71 - Grok 4.20 0309 v2 (Reasoning)
50 40.5 342 - Grok 4
51 39.8 54 large DeepSeek V4 Flash (Reasoning, High Effort)
A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.
Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.
GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)
DeepSeek is clearly the best value for $$. With the right harness and prompting.
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...
They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time
Maybe it's good at something else?
Upon closer inspection the $1B is (a) over 10 years, (b) mostly internal cross-billing between departments.
They had to start from scratch, but dont seem to have the management to be smart enough, to stop doing it in house. They could have just acquired a startup that could build a frontier model.
What is also very ironic since their whole bussiness for the last 15 years, has been buying companies a la CA Associates...
Their previous Watson branding and collapse of Watson expectations cost them one CEO, but the current CEO was part of the same team. They just dont learn....
ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.
Also, no, abandoning ethics is not an option, what a ridiculous suggestion.
Yes, if the premise was true but it’s not.
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
I mean that is the smart move here. Focus the model on optimizing the core business. For Meta, that's not coding tools.
They will forever have superior weights?
There is a lot of money in pretending that we are seeing unending revolutions.
1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.
2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.
3. Most people I talk to haven't even tried Codex for some reason
Also it's uncool to complain about downvotes.
And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
That Google is dropping the ball so badly, or just disinterested in the coding side of things... is either a sign of incompetence, or a lack of interest in losing money in that space. I wish I knew which.
If you really want to see all of them:
Or run the script