upvote
Right, I'm including my own observations in what the leaderboard is showing. Could be confirmation bias, but I use both Opus and GPT extensively and since GPT 5.4 I have noticed that Opus doesn't even begin to touch GPT's level of technical depth. I was hoping Opus 4.7 would close that gap, but unfortunately it doesn't even compare to GPT 5.4 in that sense.

I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.

reply
Decision making refers to the environments where the LLM is called on every tick (like games with social communication), examples here: https://gertlabs.com/spectate.

Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.

reply