undefined

points

[-]

Public benchmarks can be trivially faked. Lmarena is a bit harder to fake and is human-evaluated.

I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.

by moffkalast4 hours ago|

parent|

[-]

Lm arena is so easy to game that it's ceased to be a relevant metric over a year ago. People are not usable validators beyond "yeah that looks good to me", nobody checks if the facts are correct or not.

by culi3 hours ago|

parent|

[-]

Alibaba maintains its own separate version of lm-arena where the prompts are fixed and you simply judge the outputs

https://aiarena.alibaba-inc.com/corpora/arena/leaderboard

by jug4 hours ago|

parent|

prev|

[-]

I agree; LMArena died for me with the Llama 4 debacle. And not only the gamed scores, but seeing with shock and horror the answers people found good. It does test something though: the general "vibe" and how human/friendly and knowledgeable it _seems_ to be.

by nabakin4 hours ago|

parent|

prev|

[-]

It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too.

by moffkalast4 hours ago|

parent|

[-]

Well there was this one [0] which is a black box but hasn't really been kept up to date with newer releases. Arguably we'd need lots of these since each one could be biased towards some use case or sell its test set to someone with more VC money than sense.

[0] https://oobabooga.github.io/benchmark.html

by nabakin2 hours ago|

parent|

[-]

I know Arc AGI 2 has a private test set and they have a good amount of results[0] but it's not a conventional benchmark.

Looking around, SWE Rebench seems to have decent protection against training data leaks[1]. Kagi has one that is fully private[2]. One on HuggingFace that claims to be fully private[3]. SimpleBench[4]. HLE has a private test set apparently[5]. LiveBench[6]. Scale has some private benchmarks but not a lot of models tested[7]. vals.ai[8]. FrontierMath[9]. Terminal Bench Pro[10]. AA-Omniscience[11].

So I guess we do have some decent private benchmarks out there.

[0] https://arcprize.org/leaderboard

[1] https://swe-rebench.com/about

[2] https://help.kagi.com/kagi/ai/llm-benchmark.html

[3] https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

[4] https://simple-bench.com/

[5] https://agi.safe.ai/

[6] https://livebench.ai/

[7] https://labs.scale.com/leaderboard

[8] https://www.vals.ai/about

[9] https://epoch.ai/frontiermath/

[10] https://github.com/alibaba/terminal-bench-pro

[11] https://artificialanalysis.ai/articles/aa-omniscience-knowle...

by WarmWash5 hours ago|

prev|

[-]

I am unable to shake that the Chinese models all perform awfully on the private arc-agi 2 tests.

by osti3 hours ago|

parent|

[-]

But is arc-agi really that useful though? Nowadays it seems to me that it's just another benchmark that needs to be specifically trained for. Maybe the Chinese models just didn't focus on it as much.

by sdenton42 hours ago|

parent|

[-]

Doing great on public datasets and underperforming on private benchmarks is not a good look.

by Deegy2 hours ago|

parent|

[-]

Is it though? Do we still have the expectation that LLMs will eventually be able to solve problems they haven't seen before? Or do we just want the most accurate auto complete at the cheapest price at this point?

by azinman25 hours ago|

prev|

[-]

I find the benchmarks to be suggestive but not necessarily representative of reality. It's really best if you have your own use case and can benchmark the models yourself. I've found the results to be surprising and not what these public benchmarks would have you believe.

by minimaxir5 hours ago|

prev|

[-]

I can't find what ELO score specifically the benchmark chart is referring to, it's just labeled "Elo Score". It's not Codeforces ELO as that Gemma 4 31B has 2150 for that which would be off the given chart.

by nabakin5 hours ago|

parent|

[-]

It's referring to the Lmsys Leaderboard/Lmarena/Arena.ai[0]. It's very well-known in the LLM community for being one of the few sources of human evaluation data.

[0] https://arena.ai/leaderboard/chat

by BoorishBears3 hours ago|

prev|

[-]

It does not matter at all, especially when talking about Qwen, who've been caught on some questionable benchmark claims multiple times.