upvote
Yeah, the intention here is not to answer "which deck is best" - the standard of play is nowhere near high enough for that. It's meant as more of a non-saturated benchmark for different LLM models, so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old". I'm optimistic that with continued improvements to the harness and new model releases we can get to at least "official Pro Tour stream commentator" skill levels within the next few years.
reply
Hmm well, from my perspective, none of them are even really playing the game, they are just taking random actions. Any human, even a small child, would be much better.

And re: ages, it's worth noting that the youngest player to make Day 2 of a Grand Prix is 8 years old, and the youngest Pro Tour winner was 15 years old. I don't think it's realistic to get an LLM anywhere close to either of those players in skill level, though it's absolutely possible with a specialized model.

reply
> , so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old".

no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.

reply
The model does not seem to get told what the cards actually do, it's just trying to guess.

Probably less of a problem for Legacy, but it isn't even told what format it's playing. So I expect it's easy to get the bots to do far better than this by giving them more information.

Let me analyze this opening hand:

*Hand analysis:* - 3 lands (Riverpyre Verge, Steam Vents, Willowrush Verge) - dual lands which is good - Monument to Endurance ({3}) - a 3-drop - Accumulate Wisdom ({1}{U}) - 2-drop, likely a blue advantage engine - Boomerang Basics ({U}) - 1-drop blue spell - Gran-Gran ({U}) - 1-drop blue creature (1/2)

This is actually a reasonable hand. I have: - 3 lands giving me mana access (with color fixing) - A 1-drop (Gran-Gran or Boomerang Basics to choose from) - A 2-drop (Accumulate Wisdom) - A 3-drop (Monument to Endurance)

The curve is solid. I have some interactive blue play with Boomerang Basics and early board presence with Gran-Gran. The mana base supports blue well with all three lands. This is keep-able — while not spectacular, it's a functional opening hand.

{ "hand_size": 7, "action_type": "GAME_ASK", "players": "GPT52 Vet(20hp), Haiku Worrier(you,20hp)", "land_count": 3, "context": "T1 ()", "response_type": "boolean", "respond_with": "answer=true or answer=false", "message": "Mulligan down to 6 cards?", "your_hand": [ { "name": "Riverpyre Verge", "is_land": true }, { "name": "Steam Vents", "is_land": true }, { "name": "Willowrush Verge", "is_land": true }, { "mana_cost": "{3}", "name": "Monument to Endurance" }, { "mana_cost": "{1}{U}", "name": "Accumulate Wisdom" }, { "mana_cost": "{U}", "name": "Boomerang Basics" }, { "mana_cost": "{U}", "name": "Gran-Gran", "power": "1", "toughness": "2" } ], "action_pending": true }

reply
Oh, that's a good bug report - historically it was just hallucinating card effects so I made the harness throw the Oracle text for all visible cards into the context, but I bet I forgot to do that for the mulligan decision specifically (it's a weird one). Thanks!
reply
I picked a random commander game and the first message 1/4 players left was "Well I should be getting a new hand soon because I have asked for a mulligan". Its definitely in there, whether they are using it correctly is another question.
reply