GLM 5.2 Performance Benchmarks

upvote

GLM 5.2 Performance Benchmarks

(artificialanalysis.ai)

125 points

by theanonymousone10 hours ago |

upvote

by wongarsu6 hours ago|

[-]

It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark

reply

upvote

by corlinp54 minutes ago|

[-]

That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5.

I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.

reply

upvote

by SilverServer4 hours ago|

[-]

It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.

reply

upvote

by andai4 hours ago|

[-]

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?

reply

upvote

by mattalex2 hours ago|

[-]

The issue with having a "no answer" option is that you implicitly add a decision problem into your test that depends on the "cost" of answering wrong.

Specifically, your model now has two "correct" classes p(class=y|x) and p(class=⊥|x). This makes the results ambiguous. The way you resolve this is by adding in a cost of missclassification and a cost of answering wrong.

L(y, y') =

0 if y=y' l_err if y≠y' and y'≠⊥ l_⊥ if y' = ⊥

You can then estimate the expected error over your dataset. Notice that this now gives you additional degrees of freedom: Depending on how expensive answering wrong is compared to not answering at all, your predictor might be really bad or really good.

This means when benchmarking with a "no answer" action, you are often not actually benchmarking whether the model works well or not, but rather are benchmarking how well the model _happens_ to agree with the class-error weight you (implicitly) chose in your model.

reply

upvote

by WarmWash3 hours ago|

[-]

There is a tradeoff where as factual accuracy increases, creativity decreases, and the model becomes more "rigid" and less general. Unfortunately it seems that creativity is a good quality for reasoning and ultimately problem solving.

So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.

reply

upvote

by wongarsu3 hours ago|

[-]

Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something

reply

upvote

by Imustaskforhelp3 hours ago|

[-]

if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.

Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)

reply

upvote

by whimblepop4 hours ago|

[-]

Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.

reply

upvote

by otabdeveloper43 hours ago|

[-]

It's simpler than that.

An LLM outputs tokens, one-by-one. It stops the loop if it outputs the end-of-text token. Which is, of course, statistically much rarer than any other kind of token.

(This is why you cannot, in general, prompt an LLM with something like "don't answer if the result is correct". It has to output something, by design.)

reply

upvote

by trouve_search2 hours ago|

[-]

A lot of benchmarks are setup to not punish false positives (irrelevant answers or extra text) and punish false negatives (missing the snippet being looked for).

This leads to answer bloat and/or hallucination if you benchmaxx on those

reply

upvote

by Zababa3 hours ago|

[-]

They are, especially multiple choice questions. The same happens with humans exams:

Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.

If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.

reply

upvote

by fcpk23 minutes ago|

[-]

tangent question: Claude code seems to be very much loved and suggested by most major Chinese LLM using the env vars to change the server. that however means you lose a lot of anthropic tools like auto mode, running shells, monitors/crons. is there a way to get those with non anthropic plans?

reply

upvote

by gertlabs1 hours ago|

[-]

On our multi-agent coding and reasoning evaluations, GLM 5.2 is the first model we've tested that crossed the threshold of being on par with or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with test methodologies that are more vulnerable to benchmaxxing).

Data at https://gertlabs.com/rankings

reply

upvote

by lanycrost6 hours ago|

[-]

It's always nice to see how open source models growing, hope we will have good performance with lower tier hardware some day.

reply

upvote

by theturtletalks5 hours ago|

[-]

I want to trust their benchmarks but when they have Muse Spark over GPT-5.5, it gives me pause.

reply

upvote

by mdasen3 hours ago|

[-]

Where do you see that? I see they have GPT-5.5 (xhigh) at 55, GPT-5.5 (high) at 53, and Muse Spark at 43. Muse Spark does beat GPT-5.4 mini (xhigh) which scores 40, but the key there is "mini".

In the coding index, GPT-5.5 gets 59.1, 58.5, 56.2, and 52.1 for xhigh, high, medium, and low while Muse Spark is behind at 47.5. For agentic, GPT-5.5 gets 74.1, 72.0, 69.4, and 59.7 (xhigh, high, medium, low) while Muse Spark gets 62.0 (beating only GPT-5.5 low).

GPT-5.5 only gets beaten by Opus 4.8 in their general index, is the top spot for coding, and is #3 behind Opus 4.8 and GLM-5.2 for agentic (excluding Fable 5 which takes the top spot, but is unavailable).

reply

upvote

by XCSme5 hours ago|

[-]

I also tested it[0]: quite similar to GLM 5, a few percent better, 30% faster and 50% more expensive.

[0]: https://aibenchy.com/?q=glm

reply

upvote

by benxh4 hours ago|

[-]

benchmark where gemini flash is better than fable btw.

reply

upvote

by XCSme3 hours ago|

[-]

Well, most people were not liking Fable when it was available anyway, because it refused to answer questions very often.

reply

upvote

by margalabargala3 hours ago|

[-]

And therefore it scores worse on benchmarks?

reply

upvote

by XCSme2 hours ago|

[-]

Also Claude/Fable models are quite bad at instructions following: https://artificialanalysis.ai/evaluations/ifbench

reply

upvote

by XCSme2 hours ago|

[-]

On some it does yes, also in real usage.

It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.

reply

upvote

by margalabargala2 hours ago|

[-]

I'm glad those tests apparently work out for you but a benchmark where three of the top 5 models are different flavors of Gemini Flash and zero are anything by Anthropic, is just so wildly divergent from my personal experience with the models that it's not useful to me.

Whatever it is you're measuring, it's not anything related to what I use models for.

reply

upvote

by XCSme2 hours ago|

[-]

Thanks for the feedback!

What are you using Claude models for? Coding only? Computer use? Which harness?

reply

upvote

by margalabargala2 hours ago|

[-]

Not only coding but also general knowledge work, anything from learning about how some things work (e.g. walking me through PNP vs NPN transistors) to summarizing texts, doing web research, and occasionally some OCR.

I've experimented with a few models for all this and have found Gemini the best at OCR but quite a bit worse at the rest. Claude is worse than GPT at web research-shaped things, but Opus 4.8 wins my anecdote benchmark for the other tasks besides those two.

But really, for code or knowlege stuff Gemini is markedly worse than the others, while Opus and GPT 5.5 are very very close.

reply

upvote

by XCSme5 hours ago|

[-]

PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude

reply

upvote

by lousken5 hours ago|

[-]

still 1/4 of the price of anthropic and openai models though

reply

upvote

by hemkeshr4 hours ago|

[-]

Local models are already useful today. The next milestone is getting this level of performance onto truly affordable hardware.

reply

upvote

by SV_BubbleTime3 hours ago|

[-]

NVidia has less than zero reason to ship cards ideal for this at low prices.

AMD’s stock price reflects a hope they launch a CUDA alternative. But this is unlikely for the near future.

There is a lot of interest in preventing China coming in with cheap AI hardware.

So I expect the direction to be good local models that few can run effectively.

reply

upvote

by theplumber3 hours ago|

[-]

The Chinese will flood the market with cheap AI chips just like they did with EV cars. As consumers we can’t thank them enough.

reply

upvote

by omnimus2 hours ago|

[-]

It's already moving that way with Huawei AI chips.

reply

upvote

by binary1323 hours ago|

[-]

I think it will eventually result in regulation and a potential grey market, and/or implosion of the centralized LLM services — I doubt they can keep hardware from becoming cheaper forever, and diminishing returns will make consumer hardware suitable for all but the hardest problems. At that point, the hardware “moat” will be completely gone and have become an extreme unrecoverable sunk cost.

reply

upvote

by theplumber1 hours ago|

[-]

Well you have tariffs and bans on EVs as well so surely there will be bans and tariffs as well on AI products and chips but for people who really want the chips and models we know they can get it. I expect a market like it used to be for pirated content

reply

upvote

by kajman34 minutes ago|

[-]

I'm cautiously optimistic that anti-conpetitive action against hardware will fail. There's a lot of money willing to fight for cheaper inference. The same can't be said for providing consumers with cheaper cars.

I can't say I'm as optimistic about there continuing to be an open market for foreign LLMs.

reply

upvote

by DeathArrow6 hours ago|

[-]

One or two more releases and they will reach Fable level.

reply

upvote

by vitalyan1235 hours ago|

[-]

by then there will be Fable 5.21, again 5% ahead of every other SotA while still only 500% the size.

reply

upvote

by mjhay3 hours ago|

[-]

There’s no way Anthropic can keep jacking up the prices like this for every marginally better model. I think even tokenmaxxing companies are going to soon balk at $50/million output tokens.

reply

upvote

by theplumber3 hours ago|

[-]

Anthropic wants to ban the alternatives through regulation and ideally provide differential access with differential pricing.

reply

upvote

by sourcecodeplz6 hours ago|

[-]

still quite verbose at 140m output tokens, but this is on max thinking. high should do better.

reply

upvote

by ChrisArchitect5 hours ago|

[-]

Some more discussion: https://news.ycombinator.com/item?id=48567759

reply