undefined

points

[-]

This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.

https://github.com/mnky9800n/zork-bench

by kqr22 hours ago|

parent|

[-]

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

[1]: https://entropicthoughts.com/updated-llm-benchmark

(more descriptions available in earlier evaluations referenced from there)

by malfist19 hours ago|

parent|

[-]

I'm going to ignore all that and tell my developers working in complicated codebases that they have to use AI. I'm sure comprehending side effects in a world building text adventure is completely different that understanding spaghetti code

by red75prime19 hours ago|

parent|

[-]

Desarcasmed version: "I think that problems with Zork make those models virtually useless in programming tasks." Correct?

by cptskippy15 hours ago|

parent|

[-]

He said complicated code bases. LLMs are great at producing small snippets of code to address very targeted problems.

by red75prime12 hours ago|

parent|

[-]

Great on small snippets of code, passable on larger pieces of code, great at finding vulnerabilities in large pieces of code, terrible in Zork. All-in-all, a jagged frontier that defies a simple sarcastic characterization.

by girvo1 hours ago|

parent|

[-]

Very kiki, not very bouba, as Aphyr rightfully stated.

by seanmcdirmid18 hours ago|

parent|

prev|

[-]

You can code your prompts to read and write an external world model on the side. This is what most people do who are seriously doing games with LLMs.

by stingraycharles11 hours ago|

parent|

[-]

What do you mean with this? What is this world model, what does it capture?

by seanmcdirmid9 hours ago|

parent|

[-]

You keep a document going called "state of the world", on every turn, you read this document in (as context), use it to help compute what happens, and based on what happens, create an updated "state of the world" document. You track important details so your LLM is consistent from turn to turn.

If you doing an RPG, which I guess is where this is more obvious, you track the play and enemy positions, their health, their moods and perhaps top thoughts, the state of important inanimate objects. if you break down the door, you update the door's state in the document. This is in contrast to just giving the LLM the previous turns and hoping it realizes the door is broken down later (just by statistical completion).

by Schlagbohrer5 hours ago|

parent|

[-]

I would love to see consistent-world-state-capturing more integrated into, for example, SillyTavern.

by mnky9800n20 hours ago|

parent|

prev|

[-]

we should talk. i sent you an email.

by WarmWash22 hours ago|

parent|

prev|

[-]

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

by roenxi21 hours ago|

parent|

[-]

Have the open models been tried? When I look at the leaderboard [0] the only qwen model I see is 235B-A22B. I wouldn't expect an MoE model to do particularly well, from what I've seen (thinking mainly of a leaderboard trying to measure EQ [1]) MoE models are at a distinct disadvantage to regular models when it comes to complex tasks that aren't software benchmark targets.

[0] https://arcprize.org/leaderboard

[1] https://eqbench.com/index.html

by WarmWash19 hours ago|

parent|

[-]

There is GLM 5 and kimi 2.5 (which gets 11.8%, but I digress)

by CamperBob220 hours ago|

parent|

prev|

[-]

Actually the Zorks weren't deterministic, especially Zork II. The Wizard could F you over pretty badly if he appeared at an inopportune time.

by Schlagbohrer5 hours ago|

parent|

[-]

Was that using an RNG? Or is the entire game deterministic?

by doingthehula18 hours ago|

parent|

prev|

[-]

[dead]

by cbg020 hours ago|

prev|

[-]

> let the community decide

Which community are we talking about? The professionals with 10+ years experience using LLMs, the vibe coders that have no experience writing code and everyone in between? If you read some of the online communities the experiences with the models all over the place, some compare GPT 5.5 to the second coming of JC while others think it's stupider than 5.4.

I personally don't have time to build a set of private benchmarks to compare the models that are coming out so I'm mostly relying on private and semi-private benchmarks to get a feel for how models are improving before I subscribe to a service and start using it myself. At least it's something a bit more reliable than the vibes of random people and bots on reddit.

by trueno16 hours ago|

parent|

[-]

yea lol i think the community on this one is woefully unqualified to call any shots here. the goalposts are basically teleporting and everyone's aligning success with their own incredibly vague, personally created agentic non-deterministic workflows success. there's like no real answers coming from "the community" in this space at the moment, it's vividly similar to cryptocurrency cycles. most importantly, like you say, vibe coders are going to be the largest subset of the community and probably the most unqualified to assess performance because they're mostly clueless to how things work under the hood.

by WarmWash22 hours ago|

prev|

[-]

An easy way to make coding benchmarks viable again is to initialize the models with 200k of distracting or unrelated tokens in their context. Or even just run the tests sequentially in the same context and see how far the model gets before it unwinds.

These benchmarks are always greenfield, but people want a model that can deal with a rotted context.

by adamandsteve20 hours ago|

prev|

[-]

"The community" is astroturfed as hell though. Anthropic pays influencers to promote Claude Code and likely bots a ton as well, so it's hard to come to any kind of consensus online. Even if everyone was acting in good faith, some people will have a much better experience than others because of the domain they're working in (e.g. AI being much better at frontend and commonly used libraries).

The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.

by InsideOutSanta20 hours ago|

parent|

[-]

Yeah, it's crazy that there is no trustworthy source for model reviews. I'd love to know how well the new Deepseek 4 actually performs, for example, but I don't want to spend the next week testing it out. Reddit used to be a somewhat useful gauge, but now there are posts on how 4 is useless right next to posts on how amazing it is. And I have no idea if this is astroturfing, or somebody using a quantized version, or different workloads, or what.

I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?

It's all just so opaque.

by rhdunn19 hours ago|

parent|

[-]

One challenge is that model evaluation is typically domain/application specific. Model performance can also depend on the system prompt and the input/context.

Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve.

This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.

by mtrifonov7 hours ago|

prev|

[-]

Still downstream of the actual issue. The benchmarks measure capability and the bottleneck stopped being capability a while ago.

What you actually want to measure on these models is what they can SEE in production. Context shape, retrieval quality, tool use, ability to compose state across turns. None of that is in SWE-bench because SWE-bench is shaped like a one-shot problem set and frontier coding work isn't shaped like that anymore.

Even a perfectly contamination-free benchmark would mostly test the wrong axis. The model is already at human-grad-student level on isolated problems. The leverage is in how it operates inside a larger system. And that's almost like, a taste/preference issue, and virtually impossible to objectively measure.

by jvuygbbkuurx22 hours ago|

prev|

[-]

I think the solution is a bunch of private trusted benchmarks, and averaging their announced results.

by zephen19 hours ago|

parent|

[-]

> averaging their announced results.

Obligatory XKCD: https://xkcd.com/937/

by dannyw7 hours ago|

prev|

[-]

Spend a hour or an afternoon creating your own eval harness with problems or workloads from your private repos or personal projects.

Use frontier LLMs to help create the harness and identify problems, but put in the effort to ensure your verifier is actually good and robust.

Then you have your own private benchmark, which makes new model releases a breeze instead of purely vibes or contaminated public benchmarks.

For extra props, add things you care about; such as reliability (eg deliberate noise injection, simple typo introduction in problems, variants, running each test multiple times).

At the end of the day however, the best LLM is the one you’re the most productive in. Frontier intelligence might be the main factor, but far from the only factor:

• How fast is it in the real world? How well does it understand your general style of prompting / guidance?

• How consistent and reliable is it? Does it exhibit laziness / hallucination of performing actions (and saying it does) it never performed?

• etc.

by Escapado21 hours ago|

prev|

[-]

I agree with the sentiment but I wonder if a sufficiently large amount of sufficiently sophisticated benchmarks existed then I would be surprised if a model would only memorize those benchmarks while showing terrible real world performance. We are not there yet but maybe one day we will be.

by AntiUSAbah19 hours ago|

prev|

[-]

In contrary: In an Interview someone from OpenAI said they are trying to avoid it because it makes it harder for them to determine if a model gets better or not.

by thesz15 hours ago|

parent|

[-]

Perturbation of dataset used for training can introduce adversarial behavior even without adding any other data, and idea is quite simple: you take two batches from the dataset for training and select model with more probable adversarial behavior. The more batches with posterior selection get processed, the more probable adversarial behavior become.

By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.

by somenameforme9 hours ago|

prev|

[-]

I'd add another thing here as well. Many take this sort of conspiratorial view like companies training on benchmarks would be some sort of underhanded intent at cheating. In reality, benchmarks also provide a way for companies to easily compare themselves to competitors and work to iteratively improve their own models, so there's a completely non-nefarious motivation to maximize scores on benchmarks.

In the end all it does is affirm what you're saying though. Benchmarks are essentially obsolete the moment they become recognized. I suppose it's just another iteration of Goodhart's Law.

by MattRix22 hours ago|

prev|

[-]

They mention this in the article. This is why private (non public) benchmark tasks that have been made from scratch are necessary.

by cyanydeez22 hours ago|

prev|

[-]

a good benchmark would probably porting a selected repo to another language. then clear context notes, and have it port it back.

as long as theres a test framework, you could gauge success deterministically.