One LLM model played Factorio, albeit at a very, very poor level, which can be seen if you slow the video to 0.25 playback speed and pause frequently.
https://old.reddit.com/r/factorio/comments/1u1blr6/claude_fa...
There have been streams of other games, where LLMs and AIs have likewise performed very poorly.
I recognize that LLMs might be better at language processing than these sorts of tasks. But being able to play video games is part of general capability. And this kind of hardcore video game playing, with no access to game state, is also a general task where feigning skill can be harder. If LLMs excel at pretending to be competent without actually being competent, like this AI training approach is arguably about
https://en.wikipedia.org/wiki/Generative_adversarial_network
Then some AIs might be trained and designed for deceiving humans instead of actually being competent and capable. And thus, one response is that they should be met with more difficult tests.
Basically, make tests that AIs or LLMs will not have an easy time cheating. Hopefully, that will engender research in greater LLM/AI competence, not in greater ability to cheat or deceive, neither for LLM/AI researchers and companies, nor for LLMs/AIs themselves.
> I love how it only manages to beat the game because it leveled up its Charizard to level 78. Effectively making it stronger than anything else in the main campaign. Everyone else was just filler to revive it.
> There’s a reason this is timelapsed - if you slow it down to .25x speed you’ll see it getting lost in the safari zone lol
> Deeply funny how this timeskip cuts out the 50 hours it spent grinding its shitty charmander to level 22 before Brock, skips from nugget bridge to rocket hideout, skips straight to Champion from Giovanni...really picking and choosing what to show, hey
Some comments mention how it is using strategies that young children use, like mindlessly grinding and then winning through overpowered Pokemon. Also indicates that Pokemon, at least some versions of Pokemon, is a game series that has mostly fake difficulty (fraudulent game design). But it is still impressive that it could get that far, with just visual output, since the domain in Pokemon is significantly complex, even if its world positioning is tile-based.
> For those who don't know, Claude was struggling to beat Brock one year ago in Pokemon Blue. That's considerable improvement
> @techytails18 it is impressive though it's able to finally beat the game. This kind of feels like an "answer by accident" type scenario though. I'm sure six months or a year it's probably going to be speed running it though. Doing this with no harnesses impressive.
https://world.org/blog/foundational-topics/thesimpleplan
> 1. Build a private proof of human
> 2. Launch and bootstrap the network through token ownership
> 3. Reach critical scale and initial utility
> 4. Scale further through utility and decentralize
> 5. Reach global scale and help ensure AGI benefits every human