upvote
Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.
reply
Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

reply
It's trained on YouTube data. It's going to get roffle and drspectred at the very least.
reply
Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

Nonetheless I still think it's impressive that we have LLMs that can just do this now.

reply
Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.
reply
If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?
reply
I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.
reply
>Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.

reply
DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years
reply
What about Kimi and GLM?
reply
These are well behind the general state of the art (1yr or so), though they're arguably the best openly-available models.
reply
Idk man, GLM 5 in my tests matches opus 4.5 which is what, two months old?
reply
According to artificial analysis ranking, GLM-5 is at #4 after Claude Opus 4.5, GPT-5.2-xhigh and Claude Opus 4.6 .
reply
But... there's Deepseek v3.2 in your link (rank 7)
reply
How does it do on gold stake?
reply
> . I don't think there are many people who posted their Balatro playthroughs in text form online

There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

reply
Yeah, or just the steam text guides would be a huge advantage.

I really doubt it's playing completely blind

reply
> Most (probably >99.9%) players can't do that at the first attempt

Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.

reply
[dead]
reply