The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.
[1]: https://entropicthoughts.com/updated-llm-benchmark
(more descriptions available in earlier evaluations referenced from there)
If you doing an RPG, which I guess is where this is more obvious, you track the play and enemy positions, their health, their moods and perhaps top thoughts, the state of important inanimate objects. if you break down the door, you update the door's state in the document. This is in contrast to just giving the LLM the previous turns and hoping it realizes the door is broken down later (just by statistical completion).
It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.
Which community are we talking about? The professionals with 10+ years experience using LLMs, the vibe coders that have no experience writing code and everyone in between? If you read some of the online communities the experiences with the models all over the place, some compare GPT 5.5 to the second coming of JC while others think it's stupider than 5.4.
I personally don't have time to build a set of private benchmarks to compare the models that are coming out so I'm mostly relying on private and semi-private benchmarks to get a feel for how models are improving before I subscribe to a service and start using it myself. At least it's something a bit more reliable than the vibes of random people and bots on reddit.
These benchmarks are always greenfield, but people want a model that can deal with a rotted context.
The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.
I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?
It's all just so opaque.
Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve.
This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.
What you actually want to measure on these models is what they can SEE in production. Context shape, retrieval quality, tool use, ability to compose state across turns. None of that is in SWE-bench because SWE-bench is shaped like a one-shot problem set and frontier coding work isn't shaped like that anymore.
Even a perfectly contamination-free benchmark would mostly test the wrong axis. The model is already at human-grad-student level on isolated problems. The leverage is in how it operates inside a larger system. And that's almost like, a taste/preference issue, and virtually impossible to objectively measure.
Obligatory XKCD: https://xkcd.com/937/
Use frontier LLMs to help create the harness and identify problems, but put in the effort to ensure your verifier is actually good and robust.
Then you have your own private benchmark, which makes new model releases a breeze instead of purely vibes or contaminated public benchmarks.
For extra props, add things you care about; such as reliability (eg deliberate noise injection, simple typo introduction in problems, variants, running each test multiple times).
At the end of the day however, the best LLM is the one you’re the most productive in. Frontier intelligence might be the main factor, but far from the only factor:
• How fast is it in the real world? How well does it understand your general style of prompting / guidance?
• How consistent and reliable is it? Does it exhibit laziness / hallucination of performing actions (and saying it does) it never performed?
• etc.
By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.
In the end all it does is affirm what you're saying though. Benchmarks are essentially obsolete the moment they become recognized. I suppose it's just another iteration of Goodhart's Law.
as long as theres a test framework, you could gauge success deterministically.