undefined

points

[-]

The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.

The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).

Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.

You could add humans into the mix, the benchmark just gets expensive.

by runarberg6 hours ago|

parent|

[-]

I did indeed miss something. I learned after posting (but before my EDIT) that there are anchor engines that they play.

However these benchmarks still have flaws. The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

Second (and this is a minor one) Maia 1900 is currently rated at 1774 on lichess[2], but is 1816 on the leaderboard, to the author’s credit they do admit this in their methodology section.

Third, and this is a curiosity, gemini-3-pro-preview seems to have played the same game twice against Maia 1900[3][4] and in both cases Maia 1900 blundered (quite suspiciously might I add) mate in one when in a winning position with Qa3?? Another curiosity about this game. Gemini consistently played the top 2 moves on lichess. Until 16. ...O-O! (which has never been played on lichess) Gemini had played 14 most popular lichess moves, and 2 second most popular. That said I’m not gonna rule out that the fact that this game is listed twice might stem from an innocent data entry error.

And finally, apart from Gemini (and Survival bot for some reason?), LLMs seem unable to pass Maia-1100 (rated 1635 on lichess). The only anchor bot before that is random bot. And predictably LLMs cluster on both sides of it, meaning they play as well as random (apart from the illegal moves). This smells like benchmaxxing from Gemini. I would guess that the entire lichess repertoire features prominently in Gemini’s training data, and the model has memorized it really well. And is able to play extremely well if it only has to play 5-6 novel moves (especially when their opponent blunders checkmate in 1).

1: https://github.com/lightnesscaster/Chess-LLM-Benchmark/commi...

2: https://lichess.org/@/maia9

3: https://chessbenchllm.onrender.com/game/6574c5d6-c85a-4cb3-b...

4: https://chessbenchllm.onrender.com/game/4af82d60-8ef4-47d8-8...

by dwohnitmok5 hours ago|

parent|

[-]

> The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

This is not true. This is clearly spelled out in FIDE rules and is upheld at tournaments. First illegal move is a warning and reset. Second illegal move is forfeit. See here https://rcc.fide.com/article7/

I doubt GDM is benchmarkmaxxing on chess. Gemini is a weird model that acts very differently from other LLMs so it doesn't surprise me that it has a different capability profile.

by runarberg5 hours ago|

parent|

[-]

>> 7.5.5 After the action taken under Article 7.5.1, 7.5.2, 7.5.3 or 7.5.4 for the first completed illegal move by a player, the arbiter shall give two minutes extra time to his/her opponent; for the second completed illegal move by the same player the arbiter shall declare the game lost by this player. However, the game is drawn if the position is such that the opponent cannot checkmate the player’s king by any possible series of legal moves.

I stand corrected.

I’ve never actually played competitive chess, I’ve just heard this from people who do. And I thought I remembered once in the Icelandic championships where a player touched one piece but moved the other, and subsequently made to forfeit the game.

by runarberg4 hours ago|

parent|

prev|

[-]

Replying in a split thread to clearly separate where I was wrong.

If Gemini is so good at chess because of a non-LLM feature of the model, then it is kind of disingenuous to rate it as an LLM and claim that LLMs are approaching 2000 ELO. But the fact it still plays illegal moves sometimes, is biased towards popular moves, etc. makes me think that chess is still handled by an LLM, and makes me suspect benchmaxxing.

But even if no foul play, and Gemini is truly a capable chess player with nothing but an LLM underneath it, then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot. My fourth point above was my strongest point. There are only 4 anchor engines, one beats all LLMs, second beats all except Gemini, the third beats all LLMs except Gemini and Survival bot (what is Survival bot even doing there?) and the forth is random bot.

by emp173441 days ago|

prev|

[-]

That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

by famouswaffles1 days ago|

parent|

[-]

Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.

https://arxiv.org/abs/2403.15498

https://arxiv.org/abs/2501.17186

https://github.com/adamkarvonen/chess_gpt_eval

by dwohnitmok22 hours ago|

parent|

prev|

[-]

> That’s a devastating benchmark design flaw

I think parent simply missed until their later reply that the benchmark includes rated engines.

by runarberg1 days ago|

parent|

prev|

[-]

I like this game between grok-4.1-fast and maia-1100 (engine, not LLM).

https://chessbenchllm.onrender.com/game/37d0d260-d63b-4e41-9...

This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.

This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.

by famouswaffles1 days ago|

parent|

[-]

The LLMs that can play chess, i.e not make an illegal move every game do not play it simply by memorized plays.