There are different levels of "cheating" on benchmarks. The worst would be just literally putting them in the loss function during RL, I assume the major labs are not cheating at that level. And I am sure they are making a genuine effort to keep the benchmark content out of the training data.
But, ultimately it seems implausible that they completely abstain from benchmarking their model until they are about to release it. Even if they did do that, the benchmark is still ultimately a part of the outermost feedback loop. So these models are all, to _some_ degree, benchmark-solving machines.
I think all we can really do is live with the model for a while and develop a subjective feeling about its quality. This shouldn't be surprising, nobody believes that coding interviews work, we all know that you just have to work with someone to figure out if they're a good programmer. As AIs become more human like it's natural they should get harder to evaluate.
This is a bit awkward, it puts us in quite a weak position as consumers.
Maybe to some extent you can get a meaningful signal from sentiments on HN etc, but:
- There must be some amount of manipulation going on of this
- Even if it was fully organic, it's highly likely that your experience will differ materially from the median online nerd, because AIs are bizarre things that respond in unpredictable ways to intangible things.
I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).
Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].
[0]: https://aibenchy.com
It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5
At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.
I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.
How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?
There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)
I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.
Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek
But mimo seems like an interesting model and they are having some crazy discounts too.
Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.
Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.
I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.
I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.
The recent hype is Deepseek is a combination of existing name recognition along with incredibly low pricing. Their v4 models, both pro and flash are incredible for their price. That's more revolutionary than Mimo which is multiple times more expensive, just like Kimi 2.6.
Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.
I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.