In the repo, I even have a tournament script that calculates ELOs. So far, codex was unmatched. I'll try with Opus 4.8 too.
https://egeozcan.github.io/unnamed_rts/game/
https://github.com/egeozcan/unnamed_rts/blob/main/src/script...
Not sure why it did that. Its own rationale (which is highly suspect, but the only lead I have) is that it defaults to dense style if it has to write a file in a single go. May be a kernel of truth somewhere in there.
It looked gross and minimized, the result was awesome but the code looked pretty awful visually
I have a static server of my own, so here's my list (of all the tests I published so far): https://senko.net/vibecode-bench/
Minesweeper: Create a beautiful and fully functional Minesweeper clone in HTML/JS/CSS (all in one file).
RTS: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
There's too many confounding variables here, randomness just one of them. So I don't think of it as a definitive test (and reliable ordering), just another data point (along with actual benchmarks, pelicans, etc) to get a sense of the capabilities.
For example, I managed to get something out of DeepSeek 4 Flash quantized to 2-bit with Antirez' DwarfStar, used via Pi. Almost kinda worked! :) Which makes me optimistic for using local models for serious development soon - I'd say within a year.
It's a vocab building game, playable here (desktop only): https://rupertlinacre.com/vocab_annihilation/
It kind of blows my mind I can go from: 'I want a fun way to help him learn vocabulary, and I loved total annihilation as a kid' to 'heres a game that's he finds genuinely fun that helps him learn something ' in a few prompts.
After some interrogation, here's how it organized the work:
1. Design workflow (rts-game-design, 11 agents, ~13 min) ran first, produced SPEC.md + DESIGN.md:
1.1. Proposals (3 parallel agents): each designed a complete RTS from a different philosophy
1.2 Judge (1 agent): evaluated all three and synthesized one unified design, committing to specific numbers (costs, HP, map size, etc.).
1.3 Deep-dives (6 parallel agents): each wrote an implementation-ready spec for one subsystem, all consistent with the chosen design
1.4 Synthesis (1 agent): merged the design + all six subsystem specs into one conflict-free master spec
2. Code-review workflow (rts-code-review, 25 agents, ~5 min), ran after the main agent had written and tested the code:
2.1 Review (6 agents, read-only Explore type): each scrutinized one dimension and returned structured findings.
2.2. Verify (19 agents): every finding got its own skeptic agent told to try to refute it, Result: 19 flagged → 16 confirmed, 3 rejected as non-bugs.
What the main agent did in the main loop:
- Wrote all ~2,400 lines of index.html by hand from the spec.
- All browser testing/debugging via headless Chrome (I told it to use rodney by @simonw, love the tool :)
- Applied all 16 fixes from the review and re-verified them in the browser.
If you can stand a little AI expansion - here are a few points Gemini came up with - I think the idea has some merit:
https://g.co/gemini/share/b5b97867eeb1
(Maybe the better analogy is roulette vs pinball machine)
I don't think the Rube Goldberg analogy works if the agentic meandering is essential complexity required to get at the results. Rube Goldberging it would be something like putting this loop inside some comically overengineered enterprise microservice web which is then found out to be running inside a Window 98 emulator or what have you.
Yes there is: Write the code yourself
So no extra guidance beyond the prompt.
I do find it interesting that the visual style is pretty similar to things it's produced for me.
But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/
Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.
OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.
Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html
it looks quite impressive, I don't use claude currently but hearing good things about it...from codex users ironically